Improving Text Classification by Shrinkage in a Hierarchy of Classes

McCallum, Andrew; Ronald, Rosenfeld,; Thomas, Mitchell,; Ng, Andrew Y.

doi:10.1184/r1/21708647

articleJan 1, 2022GREEN OA

Improving Text Classification by Shrinkage in a Hierarchy of Classes

AMAndrew McCallumRRRosenfeld, RonaldMTMitchell, ThomasAYAndrew Y. Ng

Carnegie Mellon University

Indexed indatacite

Abstract

When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classifier can be significantly improved by taking advantage of a hierarchy of classes. We adopt an established statistical technique called shrinkage that smooths parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates. The approach is also employed in deleted interpolation, a technique for smoothing n-grams in language modeling for speech recognition. Our method scales well to large data sets, with numerous categories in large…

Citation impact

479

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

AM
Andrew McCallumCorresponding
RR
Rosenfeld, Ronald
Carnegie Mellon University
MT
Mitchell, Thomas
Carnegie Mellon University
AY
Andrew Y. Ng

Topics & keywords

Topics

Text and Document Classification Technologies90%

Keywords

Shrinkage
Computer science
Hierarchy
Artificial intelligence
Natural language processing
Information retrieval
Machine learning

UN Sustainable Development Goals

Quality Education

No related works found for this paper.