Wait! Wait! Don’t Dump That Data!
The first lesson I learned in one of my graduate-level research methods class was surprising and simple: You need variation in your data if you want to do statistical analysis. In other words, you cannot identify the causes, predictors, or so-called “drivers” of an outcome unless you have a whole variety of different outcomes in the data and a whole variety of differences among the predictors as well.
This lesson came to mind recently as we were analyzing multiple customer databases for a company that wants to understand how it can do a better job converting prospects into purchasers. Half way into the work we realized that the databases were strangely incomplete. As it turns out, the marketing and sales people regularly scrubbed the database to keep it clean and easy to work with. They were deleting old prospects who never purchased, which made it easier for them to focus their marketing efforts.
From a data mining standpoint, this was potentially disastrous. Why do some people buy, and others not buy? It would be hard to say without comparing the two groups. Even a simple “profile” of purchasers (on dimensions such as age, gender, zip code, etc.) would not be too useful without knowing whether it was similar or different from prospects who did not purchase.
Fortunately we were able to retrieve data from an archive of all data ever entered and deleted, and for one week we felt like trash pickers sorting nuggets of good trash from bad trash in order to find the crucial comparison group we needed.
Statistical analysis is built on counting and comparison. Which is why you need variation. Which is why you should never delete all that seemingly useless data from people who never buy from you.
—Joe Hopper, Ph.D.