Unified Vision-Language Pre-Training for Image Captioning and VQA
University of Michigan · Microsoft Research (United Kingdom)
Abstract
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the…
Citation impact
- FWCI
- 47.86
- Percentile
- 100%
- References
- 54
Authors
6Topics & keywords
- Closed captioning
- Computer science
- Transformer
- Question answering
- Language model
- Decoding methods
- Artificial intelligence
- Natural language processing
- Quality Education