A dirichlet multinomial mixture model-based approach for short text clustering
Abstract
Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive…
Citation impact
- FWCI
- 32.13
- Percentile
- 100%
- References
- 35
Authors
2- JYJianhua YinCorresponding
Tsinghua University
- JWJianyong Wang
Tsinghua University
Topics & keywords
- Cluster analysis
- Computer science
- Latent Dirichlet allocation
- Gibbs sampling
- Multinomial distribution
- Dirichlet distribution
- Data mining
- Artificial intelligence
- Quality Education