How Many Bots Took Your Survey?
Almost certainly more than you think. If you are purchasing access to survey respondents from panel providers, or from survey software providers, or from any other source beyond your own list (and even then, how carefully are you controlling access to your survey?) you are probably getting fraudulent data from automated bots or from survey-taker farms.
We experienced this problem just last week, and here is what we learned:
- It is not harmless noise. These bots or respondent farms are generating non-random data that will skew your statistical estimates and mess up your research.
- No source is immune. For our most recent survey, we sourced sample from the top, most expensive provider in the U.S. market, with all the usual assurances of double opt-in, identity verification, etc. We found fraud. (The provider was horrified, as they should be.)
- The usual ways of identifying fraudulent entry into a survey may not work – fingerprinting, de-duping on entry links, and cross checking with time stamps failed to flag these cases as bad. We found the problem through our own manual review of the data.
In our case, quite by accident, we found a cluster of respondents who were answering in the same or similar (and very unlikely) ways. For example, we asked about a monthly expenditure in an open-end numeric box. Most of these respondents entered 2700—a strangely high number, but possible. Then we looked down the page of our survey and noticed that the next item (that we wrote) used this number in the question text. Our guess? A bot (or a trained survey taker) scanned down for the first number it found, and then entered that into the box.
Here is what you should do to find and destroy such data in your survey:
- Implement rigorous data quality checks: Tag between-item data inconsistencies, flag speeders and straight-liners, note all suspicious open-ends, and so on. Before we even noticed the fraud, our data removal protocols found most of these bad apples based on quality alone.
- Use an IP look-up tool to identify suspicious sources of respondents. After identifying suspicious data with quality checks, we noticed that many of them originated from a small number of ISPs in rural locations with weird names like HugeData Network LLC.
- Blacklist these ISPs from your data. We found a handful of other respondents who passed our quality checks but originated from the same ISPs. On careful review, we saw similarities in their data and responses. We got rid of them all.
What if you are relying on a research vendor to do this for you? Ask them to document each step and show you the outcome. Yes, they should be ensuring data quality (and yes, they tell you they are; and no, you shouldn’t have to worry about it). But ask for the proof. The sample providers we work with (all top names in the industry) have made it clear to us that Versta Research is an outlier when it comes to addressing these problems aggressively.