Five Best Practices for Keeping Your Data “Anonymous”
I don’t know how big the Versta Research blog fan base is, but at the very least, the Quirk’s editors seem to follow us, and sometimes they ask for permission to reprint an article. Such was the case for an article we wrote last month about anonymity in research.
The gist of that article was that fancy new algorithms can do an amazingly accurate job of re-identifying specific individuals in datasets, even when datasets have been rigorously “anonymized” and stripped of identifying information. If there is enough information available, algorithms can piece together various tidbits, and it is possible to correctly identify specific people from whom that data was derived 99.98% of the time.
So what’s a researcher to do? Quirk’s asked us. And can they reprint a version of the article, along with some of our recommendations for those in the industry who are struggling with these issues, just like we are?
Sure thing. You can read the full article at Quirk’s if you wish, or continue reading right here for the five recommendations we provided to Quirk’s, formulated from some of our own evolving best practices:
- Stop using the word anonymous. The word anonymous creates a false sense of security that gives people tacit permission to share data without worrying about privacy. But they should worry. Every person in your organization who touches individual-level data must know that even when data is fully stripped of PII, it might be possible to identify who the individuals are. There is no such thing as anonymous research anymore.
- Minimize your demographics. For most research, just seven or eight demographic data points are probably needed for sampling, screening, weighting and analysis. It is always tempting to ask for more (“This might be useful later, so let’s ask just in case!”) But consider the risk. The more you ask, the easier it is for an algorithm to pick out individuals. If you do not absolutely need it, keep it out.
- Use blunt measures. This goes against everything you may have learned in a research methods classes and I, too, still resist. I want details, knowing that I can collapse data into broad groupings later. But for the sake of confidentiality, use the bluntest measure you need. Do not ask for a person’s age or year born – instead ask which age group they belong to. Do not ask for zip code – instead ask for which state they live in.
- Treat all data as confidential. Even your anonymous data sets (… which you no longer refer to as anonymous, right?) should be protected and handled with similar precautions as your data that has PII. That means all data should be encrypted and ideally never sent as an e-mail attachment. It means access should be restricted to a limited number of project personnel, all of whom should be required to maintain strong passwords and use two-factor authentication.
- Delete it when you’re done. This one is tough, too. What if sometime down the road the expensive data you just collected is valuable for something else? Digital storage costs almost nothing, so why not keep it all just in case? Again, consider the risk. As your data sits in storage safe and sound, technologies to pry it open and potentially breach the privacy of those who provided it continue to develop. Most of us outside of academic or federally funded population-based research rarely go back to our data if it is more than a couple of years old. Our advice is to play it safe and delete it when you’re done.
The guiding principle behind these best practices is to put into place rigorous security precautions and privacy protections regardless of how sanitized your data seems to be. “Anonymity is not a property of a data set, but is a property of how you use it,” said one of the researchers who created the new algorithm. Only by carefully constraining and documenting how and to whom your data is circulated can you ensure that it remains truly anonymous.
—Joe Hopper, Ph.D.