Interactive deduplication using active learning
Indian Institute of Technology Bombay
Abstract
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in…
Citation impact
- FWCI
- 26.85
- Percentile
- 100%
- References
- 43
Authors
2Topics & keywords
- Data deduplication
- Computer science
- Classifier (UML)
- Coding (social sciences)
- Key (lock)
- Training set
- Machine learning
- Artificial intelligence