Jointly Modeling Embedding and Translation to Bridge Video and Language
University of Science and Technology of China · Microsoft (United States) · +1 more institution
Abstract
Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can…
Citation impact
- FWCI
- 48.32
- Percentile
- 100%
- References
- 87
Authors
5- YPYingwei PanCorresponding
University of Science and Technology of China
- TMTao Mei
Microsoft (United States), Microsoft Research Asia (China)
- TYTing Yao
Microsoft (United States), Microsoft Research Asia (China)
- HLHouqiang Li
University of Science and Technology of China
- YRYong Rui
Microsoft (United States), Microsoft Research Asia (China)
Topics & keywords
- Computer science
- Natural language processing
- Artificial intelligence
- Sentence
- Recurrent neural network
- Semantics (computer science)
- Embedding
- Word embedding
- Quality Education