Distributed Representations of Sentences and Documents

Le, Quoc V.; Mikolov, Tomáš

doi:10.48550/arxiv.1405.4053

articlearXiv (Cornell University)May 16, 2014GREEN OA

Distributed Representations of Sentences and Documents

QVQuoc V. Le TMTomáš Mikolov

Google (United States)

Indexed inarxivdatacite

Abstract

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the…

Citation impact

5,119

total citations

FWCI: —
Percentile: —
References: 41

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Paragraph
Computer science
Artificial intelligence
Natural language processing
Feature (linguistics)
Bag-of-words model
Semantics (computer science)
Popularity

UN Sustainable Development Goals

Quality Education

No related works found for this paper.