Software Framework for Topic Modelling with Large Corpora
Abstract
Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size.…
Citation impact
- FWCI
- 34.02
- Percentile
- 100%
- References
- 25
Authors
2- RRR Rehr Uv RekCorresponding
Masaryk University
- PSPetr Sojka
Masaryk University
Topics & keywords
- Computer science
- Latent Dirichlet allocation
- Scalability
- Inference
- Implementation
- Software
- Artificial intelligence
- Topic model
- Quality Education