Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Yu, Jun; Li, Jing; Zhou, Yu; Huang, Qingming

doi:10.1109/tcsvt.2019.2947482

articleIEEE Transactions on Circuits and Systems for Video TechnologyOct 15, 2019Closed access

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

JYJun YuJLJing LiYZYu Zhou QHQingming Huang

Hangzhou Dianzi University · University of Chinese Academy of Sciences

Indexed incrossref

Abstract

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success…

Citation impact

454

total citations

FWCI: 22.76
Percentile: 100%
References: 80

Citations per year

Authors

4

JY
Jun YuCorresponding
Hangzhou Dianzi University
JL
Jing Li
Hangzhou Dianzi University
YZ
Yu Zhou
Hangzhou Dianzi University
QH
Qingming Huang
University of Chinese Academy of Sciences

Topics & keywords

Topics

Keywords

Closed captioning
Computer science
Artificial intelligence
Transformer
Encoder
Vocabulary
Image (mathematics)
Recurrent neural network

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 61702143, 61836002