Data professionals use data science tools, languages, and technologies to derive insights from data. It has been indicated in the recent survey by business broadway that data scientists use approximately four data science tools.
We are going to find out the data science tools that are mostly used and which tools work together. Another study conducted by Business broadway indicated that some tools are used together by data professionals while others are not used. To validate the answers, a new survey by Kegels State of Data Science and Machine learning showed statistics of over 16,000 data professionals on different practices of data science, including the use of forty-eight technologies, languages, and data science tools.
Principal Components Analysis by Dimension Reduction
A study was undertaken to determine the tools that are frequently used together. The experiment grouped the tools in terms of the relationship between the tools. The experiment, which is principal component analysis was used to examine the covariance and other statistical relationships in a group of variables, with the aim of examining the correlation using some variables.
When the experiment was complete, the pattern of the relationships in the 48 tools produced the results. Since the human element can be used to pass judgment on the determination of the components that describe that data, the number of components was determined by the result. The objective of the analysis was to come up with an explanation based on few components, on the relationship among the 48 tools.
Tool Groupings
The results came up with a suggestion of 14 tools that described the data. The groups that are in a specific group tend to be paired together include:
- SAP Business-objectives predictive analytics, SA JMP
- Julia, Stan
- QlikView, TIBCO Spotfire
- Angoss, Salford Systems
- Java, Perl, Oracle R Enterprise/Oracle Data Mining
- C++/C, Mathematica, MATLAB
- IBM SPSS Statistics, Minitab
- Google Cloud Compute, Amazon ML, Amazon Web Services
- IB, Cognos, IB, SPSS Modeler, IBM Watson
- RapidMiner (commercial, free), KNIME (commercial, free), Orange
- SAS Enterprise Miner, SAS Base
- Spark/MlLin, Flume, Hive/Pig/Hadoop, Impala, Cloudera
- TensorFlow, awk, R, TensorFlow, Python, Jupyter notebook
- Four data science tools that were not loaded to a single component in the analysis include NoSQL, Tableau, DataRobot and Statistica (Dell/Quest, previously Statsoft).
In data science, technologies, languages, and data science tools tend to be used together. Based on the result of the tools used, the forty-eight tools can be categorized into subgroups. It was common that product groupings like Amazon, SAS, Microsoft, and IBM were used by professionals by brand.
Other findings suggested that some tools were counter-intuitive. For instance, the use of IBM SPSS Statistics was commonly used with Mini tab and not the other IBM tools. It was also discovered that SAS JMP was linked to SAP Business Objects and not with other SAS tools.
The use of Python was discovered to be closely related to Jupiter notebooks, even though the use of R is a weak association between the two. In fact, R is commonly associated with tools like SASA and MS.
For data science vendors like Microsoft, SAS, Amazon e.t.c, the results were straightforward. Attracting new customers by companies who are cross-selling their data tools can be challenging if they want to increase their revenue; data professionals from different coding campus institutions who use data tools like Python which are open source tend to use more open source tools.
It might be a good thing if we can see the difference better data professionals who are using open source tools to those who avoid using them. It might be easy to see data professionals who are working in smaller companies like startups using open source tools which are easily available.
Conclusion
If data scientists want to improve their chances of success in data science, the right tools need to be selected. There is no tool that can do it all on its own, a set of tools has to be used in any project. You can identify the tools sets other data professionals use to help you narrow down on the tools you should use for your project.
By Rick Delgado