Describing Videos by Exploiting Temporal Structure
Université de Sherbrooke · Université de Montréal · +1 more institution
Abstract
Recent progress in using recurrent neural networks (RNNs) for image description has motivated the exploration of their application for video description. However, while images are static, working with videos requires modeling their dynamic temporal structure and then properly integrating that information into a natural language description model. In this context, we propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition…
Citation impact
- FWCI
- 61.76
- Percentile
- 100%
- References
- 72
Authors
7Topics & keywords
- Computer science
- Recurrent neural network
- Artificial intelligence
- Representation (politics)
- Convolutional neural network
- Context (archaeology)
- Motion (physics)
- Natural language