Adaptive duplicate detection using learnable string similarity measures

Bilenko, Mikhail; Mooney, Raymond J.

doi:10.1145/956750.956759

articleAug 24, 2003Closed access

Adaptive duplicate detection using learnable string similarity measures

MBMikhail Bilenko RJRaymond J. Mooney

The University of Texas at Austin

Indexed incrossref

Abstract

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of…

Citation impact

929

total citations

FWCI: 51.75
Percentile: 100%
References: 37

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Edit distance
Similarity (geometry)
Computer science
String (physics)
Support vector machine
Similarity measure
Artificial intelligence
Range (aeronautics)

No related works found for this paper.