articleGoldsmiths (University of London)Dec 1, 2004GREEN OA

RCV1: A New Benchmark Collection for Text Categorization Research

Abstract

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes.Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced.Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2.We benchmark several widely used…

Citation impact

2,604
total citations
FWCI
90.21
Percentile
100%
References
35
Citations per year

Authors

4

Topics & keywords

Keywords
  • Categorization
  • Computer science
  • Benchmark (surveying)
  • Information retrieval
  • Documentation
  • Coding (social sciences)
  • Text categorization
  • Data collection
No related works found for this paper.