There Is No Anonymous Research Anymore: How Algorithms Can Find You
In our company we nearly always try to keep data collection “anonymous.” This means that for most of our surveys we intentionally do not know the identity of who participates (though our sample providers do). We rarely ask for any type of identifying information, not even a first name. If for some reason we have identifiers (like when we work with client customer lists), we use keyed IDs and strip all personally identifiable information (PII) out of our primary data for analysis.
But no matter what we do, new research from computer scientists in the UK and Belgium have shown that a lot of “anonymized” data is not so anonymous. If there is enough information in the dataset, new algorithms can piece together various tidbits, and it is possible to correctly identify specific people from whom that data was derived 99.98% of the time.
Even Public Use Microdata Samples from the Census (that’s the PUMS Census Data we are using and analyzing all the time) is vulnerable, and should no longer be considered anonymous.
Here is how the researchers summarize their findings:
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
Yikes. We’ve always known this is possible with small subsets of people. That is why reputable research firms will not give “drill down” data cuts of your customers or employees if only a few are in the subgroup you want to explore. But it is now possible with huge groups of people with even the most rigorously executed techniques to de-identify data.
One small consolation for us is that in our type of market research, we rarely collect more than 7 or 8 demographic attributes.
A bigger consolation for us is that we think hard and worry a lot about this issue, and we subsequently build internal processes to address it. This is key. As the lead author of the research noted in an interview with the New York Times: “We need to move beyond de-identification. Anonymity is not a property of a data set, but is a property of how you use it.”
That means putting into place all the security precautions and privacy protections we can. Only by carefully constraining and documenting how and to whom our “anonymous” data is circulated, can we ensure that it remains truly anonymous.
—Joe Hopper, Ph.D.