Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Hangzhou Dianzi University · University of Chinese Academy of Sciences
Abstract
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success…
Citation impact
- FWCI
- 22.76
- Percentile
- 100%
- References
- 80
Authors
4- JYJun YuCorresponding
Hangzhou Dianzi University
- JLJing Li
Hangzhou Dianzi University
- YZYu Zhou
Hangzhou Dianzi University
- QHQingming Huang
University of Chinese Academy of Sciences
Topics & keywords
- Closed captioning
- Computer science
- Artificial intelligence
- Transformer
- Encoder
- Vocabulary
- Image (mathematics)
- Recurrent neural network
- Quality Education