articleJun 1, 2016Closed access

Jointly Modeling Embedding and Translation to Bridge Video and Language

University of Science and Technology of China · Microsoft (United States) · +1 more institution

Indexed incrossref

Abstract

Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can…

Citation impact

597
total citations
FWCI
48.32
Percentile
100%
References
87
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Natural language processing
  • Artificial intelligence
  • Sentence
  • Recurrent neural network
  • Semantics (computer science)
  • Embedding
  • Word embedding
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.