Monday, November 11, 2013

The Half-Empty Promise of Big Data

Anybody who works in a high-tech industry or product development knows that “big data” is one of today’s hottest buzz words. As I sat through numerous panel discussions and education sessions at a technology conference last week, “data mining,” “big data,” or “analytics” worked their way into almost every talk, no matter the subject. The typical context was that “big data is the key to solving this problem,” or “Data mining is the new Holy Grail in our industry.” Except in many cases these expectations are probably unrealistic.

Data mining is the analysis of large quantities of data to extract previously unknown interesting patterns and dependencies. Conventional thinking is that we can now mine mountains of data generated in our increasingly connected world, piece together important relationships and amazing connections that we never knew existed, and change the world. Or so they say. As a quantitative number cruncher and data geek, I see the potential.

But I also see a big pitfall in this thinking which is rarely talked about. Quite simply, most data being collected is simply not that useful. It's mundane personal data (locations, what websites a person visited, what purchases they made), and mining large sets of mundane data will produce some rather humdrum results at best. As an example, consider the shopper club cards used by most grocery chains. After analyzing billions of purchases by millions of people over the past two decades, what have the stores learned? That some people buy Pop Tarts in addition to flashlight batteries before a big storm. Interesting? Perhaps. World changing? Hardly.

The best applications of big data that I’m familiar with do not entail data mining at all, which has become a buzzword, and gets misused to mean any type of large-scale data analysis. For example, predicting flu outbreaks real-time based on citywide emergency room visits is a great use of big data that has both health and economic benefits. But it's not data mining. It's just plain old data analysis, with certain input variables that are predictive of an output variable. There is nothing unlikely or surprising about these relationships, but we are now able to query data in real time from sources that were previously unavailable or unconnected.

There are many great uses for big data, but people in high-tech industries and product development have developed unrealistic expectations, particularly for data mining. Most data being collected is simply not that useful, and this leads to the half-empty promise of big data.