Estimating the success of re-identifications in incomplete datasets using generative models

Rocher, Luc; Hendrickx, Julien M.; Montjoye, Yves-Alexandre de

doi:10.1038/s41467-019-10933-3

articleNature CommunicationsJul 23, 2019GOLD OA

Estimating the success of re-identifications in incomplete datasets using generative models

LRLuc Rocher JMJulien M. Hendrickx YDYves-Alexandre de Montjoye

Imperial College London · UCLouvain

PubMed

Indexed incrossrefdoajpubmed

Abstract

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset…

Citation impact

789

total citations

FWCI: 63.11
Percentile: 100%
References: 58

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Computer science
Generative grammar
Identification (biology)
Generative model
Data mining
Machine learning
Data sharing
Data set

UN Sustainable Development Goals

Decent work and economic growth

No related works found for this paper.