articleJan 1, 2004GOLD OA
Unsupervised construction of large paraphrase corpora
Indexed incrossref
Abstract
We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by…
Citation impact
739
total citations
- FWCI
- 19.43
- Percentile
- 100%
- References
- 15
Citations per year
Authors
3Topics & keywords
Topics
Keywords
- Paraphrase
- Computer science
- Natural language processing
- Artificial intelligence
- Sentence
- Set (abstract data type)
- Metric (unit)
- Heuristic
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.