Automatically Constructing a Corpus of Sentential Paraphrases.

Dolan, William B.; Brockett, Chris

articleJan 1, 2005Closed access

Automatically Constructing a Corpus of Sentential Paraphrases.

Abstract

An obstacle to research in automatic paraphrase identification and genera-tion is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Micro-soft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judg-ment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classi-fier to select likely sentence-level para-phrases from a large corpus of topic-clustered news data. These pairs were then submitted to human judges, who confirmed that 67 % were in fact se-mantically equivalent. In addition…

Citation impact

1,116

total citations

FWCI: 0.91
Percentile: 100%
References: 28

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Paraphrase
Natural language processing
Computer science
Artificial intelligence
Sentence
Classifier (UML)
Textual entailment
Identification (biology)

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.