RCV1: A New Benchmark Collection for Text Categorization Research
Abstract
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes.Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced.Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data.We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2.We benchmark several widely used…
Citation impact
2,604
total citations
- FWCI
- 90.21
- Percentile
- 100%
- References
- 35
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Categorization
- Computer science
- Benchmark (surveying)
- Information retrieval
- Documentation
- Coding (social sciences)
- Text categorization
- Data collection
No related works found for this paper.