Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Hangzhou Dianzi University · University of Chinese Academy of Sciences

Indexed incrossref

Abstract

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success…

Citation impact

454
total citations
FWCI
22.76
Percentile
100%
References
80
Citations per year

Authors

4

Topics & keywords

Keywords
  • Closed captioning
  • Computer science
  • Artificial intelligence
  • Transformer
  • Encoder
  • Vocabulary
  • Image (mathematics)
  • Recurrent neural network
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding