articleNature CommunicationsJul 23, 2019GOLD OA

Estimating the success of re-identifications in incomplete datasets using generative models

Imperial College London · UCLouvain

PubMed
Indexed incrossrefdoajpubmed

Abstract

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset…

Citation impact

789
total citations
FWCI
63.11
Percentile
100%
References
58
Citations per year

Authors

3

Topics & keywords

Keywords
  • Computer science
  • Generative grammar
  • Identification (biology)
  • Generative model
  • Data mining
  • Machine learning
  • Data sharing
  • Data set
UN Sustainable Development Goals
  • Decent work and economic growth
No related works found for this paper.

Funding