Unsupervised construction of large paraphrase corpora

Dolan, Bill; Quirk, Chris; Brockett, Chris

doi:10.3115/1220355.1220406

articleJan 1, 2004GOLD OA

Unsupervised construction of large paraphrase corpora

BDBill Dolan CQChris Quirk CBChris Brockett

Microsoft (United States)

Indexed incrossref

Abstract

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by…

Citation impact

739

total citations

FWCI: 19.43
Percentile: 100%
References: 15

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Paraphrase
Computer science
Natural language processing
Artificial intelligence
Sentence
Set (abstract data type)
Metric (unit)
Heuristic

UN Sustainable Development Goals

Quality Education

No related works found for this paper.