A Universal Part-of-Speech Tagset

Petrov, Slav; Das, Dipanjan; McDonald, Ryan

doi:10.63317/3pjyez8kmkhz

articleMay 1, 2012GREEN OA

A Universal Part-of-Speech Tagset

SPSlav Petrov DDDipanjan Das RMRyan McDonald

Google (United States) · Carnegie Mellon University

Indexed inarxivcrossrefdatacite

Abstract

To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags.

Citation impact

734

total citations

FWCI: 95.51
Percentile: 100%
References: 46

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Treebank
Computer science
Natural language processing
Grammar induction
Artificial intelligence
Parsing
Dependency (UML)
Part of speech

No related works found for this paper.