Probabilistic Latent Semantic Indexing
International Computer Science Institute · University of California, Berkeley
Abstract
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as…
Citation impact
- FWCI
- 449.57
- Percentile
- 100%
- References
- 20
Authors
1Topics & keywords
- Computer science
- Probabilistic latent semantic analysis
- Probabilistic logic
- Search engine indexing
- Artificial intelligence
- Generalization
- Mathematics