This is the tenth instalment in the series of essays on Statistics Denial by Randy Bartlett, Ph.D. To read other articles in the series, click here.
Myth #7: Big Data Volume (Or Large N) contains complete information
Myth #8: Big Data Volume (Or Large N) speaks for itself
Myth #9: Big Data Volume (Or Large N) replaces sampling and other statistics—so much information
The first misunderstanding about Big Data is that there is a data rush. To be more acccurate, we are in an information rush offering mind-boggling potential. To benefit the most, we need to filter the misleading promotional hype regarding something that has been around since astronomers started mapping the universe, Big Data.
We will define Big Data thusly, ‘Big Data is reached at the edge of our capabilities to manage (IT) or analyze (statistics) the Volume, Velocity, and Variety of the data.’ At this point, one of the Vs becomes part of the problem. Other experts, such as Diego Kuonen, include a fourth V, Veracity. The information inside the data has its own Vs and this information is what matters. For this blog, we will debunk three myths in the context of the Big Data volume (or large N) aspect of Big Data,
Myth: Big Data Contains Complete Information:
According to promotional hype, Large N/Big Data Volume contains complete information … not so. Many seek the rapture of making definitive proclamations in a deterministic universe. However, uncertainty complicates matters.
Analyzable data has two dimensions: variables and observations. Increasing the number of variables or observations increases the space for storng information. It does not cause complete information or even more information. Even if we have all of the variables and all of the observations, we can expect uncertainty with the numbers.
Four common sources of uncertainty: inferential, missing values, measurement error, and surrogate variables, are explained in the May/June 2015 issue of Analytics Magazine, http://goo.gl/Wod3gk
The value of Big Data is its information content, which will not be complete.
Myth: Big Data Speaks For Itself:
Promotional hype has proclaimed that Large N (Big Data: Volume) allows data to “speak for itself,” without the intermediation of a priori assumptions. The hype is that bigness facilitates self-explanatory “visualizations,” which require no interpretive interventions by experts, whose views, it is claimed, are “biased” by pre-conceived or overly theoretical notions about how the world works. In fact, we have seen many examples of these visualizations, and they are usually more biased, confusing, misleading, and uninformative than graphs generated in times of old, i.e., BBDH (Before Big Data Hype).
All data requires interpretation, domain knowledge, and an understanding of the underlying assumptions. The value of Big Data is its information content, which does not speak for itself.
Myth: Big Data Replaces Statistics:
During the height of the Big Data hype, some statistics deniers gleefully proclaimed that the advent of Big Data spelled the end of statistics and statisticians, and by implication everyone using statistics to analyze data—econometricians, sociologists, physicists, and other quants. This ‘Amish view’ is that we no longer need statistics because we have Large N/Big Data … and we have them, these ‘Amish visionaries,’ with no expertise in statistics.
We rely upon statistics to address uncertainty with the numbers. More data may contain more information, yet this will not resolve the uncertainty. The first step in analyzing Large N/Big Data is to use statistics (science) to reduce the data without losing information. This is a good trick and it facilitates the next step, using statistics to extract information from the data—as always.
The value of Big Data is its information content, which requires statistics to extract it.
We are in an information rush that has the potential to accelerate almost every aspect of the human endeavor. To benefit the most, we need to filter the misleading promotional hype regarding Big Data.
The value of Big Data is its information content, which will not be complete; does not speak for itself; and requires statistics to extract it. Statistics addresses the uncertainty in all data.
We sure could use Deming, right now. Many of us who embrace the explicit rigorous logic and protocols of these tenets of data analysis hang out in the new LinkedIn group, About Data Analysis. Come see us.