This is the seventh instalment in the series of essays on Statistics Denial by Randy Bartlett, Ph.D. To read other articles in the series, click here.
Myth #3: Data mining, machine learning, Big data analysis, business analytics, and data science are distinct from statistics
Repackaging statistics with complementary fields has the potential to create new synergies. E.g., econometrics is the marriage of economics and statistics. This repackaging has been extremely successful; surpassing Six Sigma’s mixed results as we discussed in Blog 2. Econometrics has embraced statistics. Applied econometricians have helped develop best practice and some identify as applied statisticians. They are with the science.
Straddling Distinct Applications:
The interests of the ‘promotional industrial complex’ are to sell things, things like books, magazines, conferences, workshops, new degree programs, software, advertisement space, and newly anointed ‘experts.’ New things sell better. Promotional interests are not married to protecting the integrity of statistics or best practice.
The IT part of the promotional industrial complex has started using the terms Data Mining, Machine Learning, and Data Science to include both data analysis AND data management. In the field, we are problem oriented and not tool oriented. This makes repackaging data analysis with data management comparable to packaging addition problems with sorting problems and calling it ‘Add-Sort Science.’ In some circles, the point of this repackaging is less about finding synergies and more about expanding IT, giving IT more missions. The next unbelievably bad idea, being shopped around, is that data analysis is somehow a data management problem?! If it were, then we should be better at applying statistics to reporting and data collection.
Data analysis and data management are distinct applications, separated by differences in culture, software, objectives, and thinking. Data management emphasizes efficiency in storing and accessing data, and statistics is about extracting information in the presence of uncertainty.
Any repackaging of data analysis with data management that removes statistics expertise from the data analysis is a bad idea. However, bad ideas can happen, even linger. A popular trend in the 1960s was for corporations to merge into conglomerations … which provided no synergies, made no economic sense. Without the mergers, shareholders could invest in each company separately and realize the same return. Even so, these conglomerates continued for a decade and these types of mergers still happen. The merger of data analysis with data management does not have to make sense, and it can last a long time without making sense.
In the case of Machine Learning, academia emphasizes how this set of tools work; their facility for iterative learning. In the field, we have a problem-based view. Any Machine Learning tool that solves statistics problems will necessarily make statistics assumptions and require statistical thinking. These tools are part of statistics or Statistical Machine Learning. We provide a clarifying problem-based view of statistics in the May/June 2015 issue of Analytics Magazine, http://goo.gl/Wod3gk.
The Venn Diagram in Figure 1 illustrates two areas of application for applying Machine Learning: data analysis and data management. From an applied perspective, there is no overlap.
The same relationship holds for Data Mining and Data Science. In the field we should be problem-based and that means splitting these problems:
Machine Learning = Statistical ML + IT ML,
Data Mining = Statistical DM + IT MD, and
Data Science = Statistical DS + IT DS.
What Is Wanted:
We want to keep the statistics expertise on the data analysis. This means embracing specialization, even if this does not help the promotional industrial complex to sell things. Large corporations need separate teams for data analysis and data management. Business managers should be playing chess, not checkers.
Consumers of data analysis should look to statistical certifications, like the PSTAT, to ensure that the Statistical Qualifications are brought to bear.
There is a flood of statistical malfeasance on its way. Wise consumers of data analysis want to avoid removing statistics expertise from their data analysis.
Repackaging data analysis/statistics with data management/IT will not provide further synergies in the field. It will just sell things.
In the field, we are problems based. We want to split Data Science, Data Mining, and Machine Learning to match our business problems: Statistical DS, DM, & ML and IT DS, DM, & ML.
We sure could use Deming, right now. Many of us, who consume or produce data analysis, hang out in the new LinkedIn group: About Data Analysis. Come see us.