articleJan 1, 2006Closed access

Topic modeling

University of Cambridge

Indexed incrossref

Abstract

Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language…

Citation impact

1,065
total citations
FWCI
20.51
Percentile
100%
References
8
Citations per year

Authors

1

Topics & keywords

Keywords
  • Bigram
  • Language model
  • Computer science
  • Artificial intelligence
  • Latent Dirichlet allocation
  • Topic model
  • n-gram
  • Probabilistic latent semantic analysis
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.