articleJan 1, 2010GOLD OA

Software Framework for Topic Modelling with Large Corpora

RRR Rehr Uv RekPSPetr Sojka

Masaryk University

Indexed indatacite

Abstract

Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size.…

Citation impact

3,803
total citations
FWCI
34.02
Percentile
100%
References
25
Citations per year

Authors

2

Topics & keywords

Keywords
  • Computer science
  • Latent Dirichlet allocation
  • Scalability
  • Inference
  • Implementation
  • Software
  • Artificial intelligence
  • Topic model
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.