Sequence to Sequence -- Video to Text
The University of Texas at Austin · International Computer Science Institute · +2 more institutions
Abstract
Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able…
Citation impact
- FWCI
- 66.25
- Percentile
- 100%
- References
- 73
Authors
6- SVSubhashini VenugopalanCorresponding
The University of Texas at Austin
- MRMarcus Rohrbach
International Computer Science Institute, University of California, Berkeley
- JDJeffrey Donahue
University of California, Berkeley
- RJRaymond J. Mooney
The University of Texas at Austin
- TDTrevor Darrell
University of California, Berkeley
Topics & keywords
- Computer science
- Sequence (biology)
- Exploit
- Set (abstract data type)
- Artificial intelligence
- Sentence
- Natural language processing
- Recurrent neural network
- Quality Education