Pointy-Haired Boss Finds Significant Correlation
Before the days of DATA MINING this was called DATA DREDGING. And for good reasons, because it was considered not a good thing. Almost any dataset will have nonsense correlations, and if you’ve turned on your magical statistical testing tool at the 95% confidence level, five out of every one hundred nonsense correlations will likely be flagged as statistically significant. If you happen to be working with your Google survey data, this qualifies for Google’s cute little light bulb icon announcing it as an “insight,” which is utterly ridiculous and utterly embarrassing.
So what’s a data miner to do? The secret to avoiding embarrassing mistakes like this is to approach your data with authentic business questions and hypotheses that can be tested. Uncover the mission critical questions, appreciate the nice-to-know questions, specify the red-herring questions, set aside the already-answered questions, and understand which ones are look-elsewhere questions. The process we’ve outlined in The Art of Asking Questions is one approach to getting there.
It is nearly always worth exploring the data you have. But the goal of data mining should rarely be about finding patterns and then explaining them. There are simply too many patterns to be found, and most of them are nonsense. You need to enter the mine with a purpose.