Academics at Imperial College London and Belgium’s Université Catholique de Louvain say they have developed an algorithm that can identify over 99 percent of US citizens from almost any available data set using just 15 demographic attributes, such as gender, ZIP (or postal) code or marital status.
In a paper published this week in Nature, Luc Rocher, Julien M. Hendrickx and Yves-Alexandre de Montjoye used the findings to challenge the widespread “release and forget” of anonymised medical, behavioral, and socio-demographic data. They have built a micro site and published the codebase to demonstrate the tool.
The release comes amid a huge rise in the availability of public data sets, including anonymised location data based on Wi-Fi and telco data.
The model raises serious questions about the significant global market for purportedly anonymised data and the methods used to anonymise it. (German researchers were able to identify a judge’s porn preferences in an “anonymous” browsing history dataset they bought legally in 2017, de-anonymized, then presented at DefCon).
De-Anonymized Easily: GDPR Standards Inadequate?
The researchers said in the paper’s abstract: “Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”
Using a model initially trained on the Census Bureau’s Public Use Microdata Sample (PUMS) programmed in Julia and Python and using a latent Gaussian copula, they demonstrated a mean absolute error (MAE) of 0.018 on average in estimating population uniqueness and an MAE of 0.041 in estimating population uniqueness when training on only a 1% population sample. (Further model details can be found in the paper titled “Estimating the success of re-identifications in incomplete datasets using generative models.”)
Data protection laws worldwide typically consider anonymous data to no longer be personal data anymore, allowing it to be freely used, shared, and sold. Academic journals are also increasingly requiring authors to make anonymous data available to the research community, the researchers emphasised.
Modern datasets contain a large number of points per individuals. For instance, the data broker Experian sold Alteryx access to a de-identified dataset containing 248 attributes per household for 120 million Americans; Cambridge university researchers shared anonymous Facebook data for three million users collected through the myPersonality app as part of the Cambridge Analytica scandal.
The data market is widely tapped by marketing companies to help personalised advertising, as well as for academic research purposes.
Publishing the source code to reproduce the experiments in a bid to raise awareness of the issue, the researchers noted: “Sampling a dataset does not provide plausible deniability and does not effectively protecting people’s privacy.
“We believe that, in general, it is time to move away from de-identification and tighten the rules for constitute truly anonymized data. Making sure data can be used statistically, e.g., for medical research is extremely important but cannot happen at the expense of people’s privacy. “