Automatically Constructing a Corpus of Sentential Paraphrases.
Microsoft Research (United Kingdom)
Abstract
An obstacle to research in automatic paraphrase identification and genera-tion is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Micro-soft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judg-ment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classi-fier to select likely sentence-level para-phrases from a large corpus of topic-clustered news data. These pairs were then submitted to human judges, who confirmed that 67 % were in fact se-mantically equivalent. In addition…
Citation impact
- FWCI
- 0.91
- Percentile
- 100%
- References
- 28
Authors
2Topics & keywords
- Paraphrase
- Natural language processing
- Computer science
- Artificial intelligence
- Sentence
- Classifier (UML)
- Textual entailment
- Identification (biology)
- Peace, Justice and strong institutions