Estimating the success of re-identifications in incomplete datasets using generative models
Imperial College London · UCLouvain
Abstract
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset…
Citation impact
- FWCI
- 63.11
- Percentile
- 100%
- References
- 58
Authors
3Topics & keywords
- Computer science
- Generative grammar
- Identification (biology)
- Generative model
- Data mining
- Machine learning
- Data sharing
- Data set
- Decent work and economic growth