articleJan 1, 2005Closed access

Automatically Constructing a Corpus of Sentential Paraphrases.

Microsoft Research (United Kingdom)

Abstract

An obstacle to research in automatic paraphrase identification and genera-tion is the lack of large-scale, publicly-available labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Micro-soft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judg-ment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classi-fier to select likely sentence-level para-phrases from a large corpus of topic-clustered news data. These pairs were then submitted to human judges, who confirmed that 67 % were in fact se-mantically equivalent. In addition…

Citation impact

1,116
total citations
FWCI
0.91
Percentile
100%
References
28
Citations per year

Authors

2

Topics & keywords

Keywords
  • Paraphrase
  • Natural language processing
  • Computer science
  • Artificial intelligence
  • Sentence
  • Classifier (UML)
  • Textual entailment
  • Identification (biology)
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.