articleDec 1, 2015Closed access

Sequence to Sequence -- Video to Text

The University of Texas at Austin · International Computer Science Institute · +2 more institutions

Indexed incrossref

Abstract

Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able…

Citation impact

1,399
total citations
FWCI
66.25
Percentile
100%
References
73
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Sequence (biology)
  • Exploit
  • Set (abstract data type)
  • Artificial intelligence
  • Sentence
  • Natural language processing
  • Recurrent neural network
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding