View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Data
July 24, 2019updated 25 Jul 2019 9:39am

This Algorithm can Identify 99.98% of Americans in “Anonymised” Datasets

"Our results seriously challenge the technical and legal adequacy of the de-identification release-and-forget model"

By CBR Staff Writer

Academics at Imperial College London and Belgium’s Université Catholique de Louvain say they have developed an algorithm that can identify over 99 percent of US citizens from almost any available data set using just 15 demographic attributes, such as gender, ZIP (or postal) code or marital status.

In a paper published this week in Nature, Luc RocherJulien M. Hendrickx and Yves-Alexandre de Montjoye used the findings to challenge the widespread “release and forget” of anonymised medical, behavioral, and socio-demographic data. They have built a micro site and published the codebase to demonstrate the tool.

The release comes amid a huge rise in the availability of public data sets, including anonymised location data based on Wi-Fi and telco data.

See also: Near, the “Largest Platform on Physical World Behaviour”, Raises $100 Million

The model raises serious questions about the significant global market for purportedly anonymised data and the methods used to anonymise it. (German researchers were able to identify a judge’s porn preferences in an “anonymous” browsing history dataset they bought legally in 2017, de-anonymized, then presented at DefCon).

De-Anonymized Easily: GDPR Standards Inadequate?

The researchers said in the paper’s abstract: “Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”

Using a model initially trained on the Census Bureau’s Public Use Microdata Sample (PUMS) programmed in Julia and Python and using a latent Gaussian copula, they demonstrated a mean absolute error (MAE) of 0.018 on average in estimating population uniqueness and an MAE of 0.041 in estimating population uniqueness when training on only a 1% population sample. (Further model details can be found in the paper  titled “Estimating the success of re-identifications in incomplete datasets using generative models.”)

Data protection laws worldwide typically consider anonymous data to no longer be personal data anymore, allowing it to be freely used, shared, and sold. Academic journals are also increasingly requiring authors to make anonymous data available to the research community, the researchers emphasised.

Content from our partners
How to turn the evidence hackers leave behind against them
Why food manufacturers must pursue greater visibility and agility
How to define an empowered chief data officer

Modern datasets contain a large number of points per individuals. For instance, the data broker Experian sold Alteryx access to a de-identified dataset containing 248 attributes per household for 120 million Americans; Cambridge university researchers shared anonymous Facebook data for three million users collected through the myPersonality app as part of the Cambridge Analytica scandal.

The data market is widely tapped by marketing companies to help personalised advertising, as well as for academic research purposes.

Publishing the source code to reproduce the experiments in a bid to raise awareness of the issue, the researchers noted: “Sampling a dataset does not provide plausible deniability and does not effectively protecting people’s privacy.

“We believe that, in general, it is time to move away from de-identification and tighten the rules for constitute truly anonymized data. Making sure data can be used statistically, e.g., for medical research is extremely important but cannot happen at the expense of people’s privacy. “

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU