Hurdles to Using Big Data
Last year, Gartner, which charts new trends and emerging technologies via its “Hype Cycle” tracking, decided to abandon its tracking of Big Data. The reason? Big data, they said, has become so ubiquitous and embedded in so many other technologies, that it is no longer a thing of its own.
And yet, how many of us in marketing and market research have started working with and truly analyzing big data? Not many. Most of us have only a vague feel for what it is, even. Is Census data big data? I don’t think so. Is credit card transaction data big data? Not really. Many of us have been working with “big” data sets like these for many years. Big data is something quite different.
My sense is that the reason most of us are not yet working with big data (as ubiquitous as it may be) is that it’s still too hard and too “unknown” to work with. Good tools—both technological and intellectual—just aren’t yet ready for prime time.
I was reminded of this as I read the summary report from a recent meeting on The Future of the Statistical Sciences, convened in London by the American Statistical Association, the Royal Statistical Society, and four other leading statistical organizations. Big data, the report said, is “undoubtedly the greatest challenge and opportunity that confronts today’s statisticians.”
Wait a minute—challenge? Yes. Even the statisticians are looking at all this big data and not knowing quite what to do with it. Quoting their report, here are the challenges (not even close to being solved!) confronting all of us in research, including our statistician brethren:
- Problems of scale. Many popular algorithms for statistical analysis do not scale up very well and run hopelessly slowly on terabyte-scale data sets. Statisticians either need to improve the algorithms or design new ones that trade off theoretical accuracy for speed.
- Different kinds of data. Big Data are not only big, they are complex and they come in different forms from what statisticians are used to, for instance images or networks.
- The “look-everywhere effect.” As scientists move from a hypothesis-driven to a data-driven approach, the number of spurious findings (e.g., genes that appear to be connected to a disease but really aren’t) is guaranteed to increase, unless specific precautions are taken.
- Privacy and confidentiality. This is probably the area of greatest public concern about Big Data, and statisticians cannot afford to ignore it. Data can be anonymized to protect personal information, but there is no such thing as perfect security.
- Reinventing the wheel. Some of the collectors of Big Data—notably, web companies—may not realize that statisticians have generations of experience at getting information out of data, as well as avoiding common fallacies. Some statisticians resent the new term “data science.” Others feel we should accept the reality that “data science” is here and focus on ensuring that it includes training in statistics.
In short, big data isn’t just big—it’s vastly different from the data we typically work with. In case you’re feeling bad or inadequate because you are not yet doing much with all that ubiquitous big data that Gartner doesn’t even bother to track any more, don’t feel so bad. Even the statisticians are stymied for now.