articleJul 23, 2002Closed access

Interactive deduplication using active learning

Indian Institute of Technology Bombay

Indexed incrossref

Abstract

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in…

Citation impact

693
total citations
FWCI
26.85
Percentile
100%
References
43
Citations per year

Authors

2

Topics & keywords

Keywords
  • Data deduplication
  • Computer science
  • Classifier (UML)
  • Coding (social sciences)
  • Key (lock)
  • Training set
  • Machine learning
  • Artificial intelligence
No related works found for this paper.