articleMay 1, 2012GREEN OA
A Universal Part-of-Speech Tagset
Google (United States) · Carnegie Mellon University
Indexed inarxivcrossrefdatacite
Abstract
To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags.
Citation impact
734
total citations
- FWCI
- 95.51
- Percentile
- 100%
- References
- 46
Citations per year
Authors
3Topics & keywords
Keywords
- Treebank
- Computer science
- Natural language processing
- Grammar induction
- Artificial intelligence
- Parsing
- Dependency (UML)
- Part of speech
No related works found for this paper.