Sequence to Sequence -- Video to Text

Venugopalan, Subhashini; Rohrbach, Marcus; Donahue, Jeffrey; Mooney, Raymond J.; Darrell, Trevor; Saenko, Kate

doi:10.1109/iccv.2015.515

articleDec 1, 2015Closed access

Sequence to Sequence -- Video to Text

SVSubhashini Venugopalan MRMarcus Rohrbach JDJeffrey Donahue RJRaymond J. Mooney TDTrevor Darrell

The University of Texas at Austin · International Computer Science Institute · +2 more institutions

Indexed incrossref

Abstract

Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able…

Citation impact

1,399

total citations

FWCI: 66.25
Percentile: 100%
References: 73

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Sequence (biology)
Exploit
Set (abstract data type)
Artificial intelligence
Sentence
Natural language processing
Recurrent neural network

UN Sustainable Development Goals

Quality Education

No related works found for this paper.