Improving Text Classification by Shrinkage in a Hierarchy of Classes
Abstract
When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classifier can be significantly improved by taking advantage of a hierarchy of classes. We adopt an established statistical technique called shrinkage that smooths parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates. The approach is also employed in deleted interpolation, a technique for smoothing n-grams in language modeling for speech recognition. Our method scales well to large data sets, with numerous categories in large…
Citation impact
- FWCI
- —
- Percentile
- —
- References
- 0
Authors
4- AMAndrew McCallumCorresponding
- RRRosenfeld, Ronald
Carnegie Mellon University
- MTMitchell, Thomas
Carnegie Mellon University
- AYAndrew Y. Ng
Topics & keywords
- Shrinkage
- Computer science
- Hierarchy
- Artificial intelligence
- Natural language processing
- Information retrieval
- Machine learning
- Quality Education