articleJan 1, 2004GOLD OA

Unsupervised construction of large paraphrase corpora

Microsoft (United States)

Indexed incrossref

Abstract

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by…

Citation impact

739
total citations
FWCI
19.43
Percentile
100%
References
15
Citations per year

Authors

3

Topics & keywords

Keywords
  • Paraphrase
  • Computer science
  • Natural language processing
  • Artificial intelligence
  • Sentence
  • Set (abstract data type)
  • Metric (unit)
  • Heuristic
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.