articleAug 9, 2003Closed access

A comparison of string distance metrics for name-matching tasks

Carnegie Mellon University

Abstract

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.

Citation impact

1,377
total citations
FWCI
54.31
Percentile
100%
References
20
Citations per year

Authors

3

Topics & keywords

Keywords
  • Computer science
  • Edit distance
  • String metric
  • String searching algorithm
  • Weighting
  • String (physics)
  • Approximate string matching
  • Matching (statistics)
No related works found for this paper.