Unified Vision-Language Pre-Training for Image Captioning and VQA

University of Michigan · Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the…

Citation impact

833
total citations
FWCI
47.86
Percentile
100%
References
54
Citations per year

Authors

6

Topics & keywords

Keywords
  • Closed captioning
  • Computer science
  • Transformer
  • Question answering
  • Language model
  • Decoding methods
  • Artificial intelligence
  • Natural language processing
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding