Unified Vision-Language Pre-Training for Image Captioning and VQA

Zhou, Luowei; Palangi, Hamid; Zhang, Lei; Hu, Houdong; Corso, Jason J.; Gao, Jianfeng

doi:10.1609/aaai.v34i07.7005

articleProceedings of the AAAI Conference on Artificial IntelligenceApr 3, 2020DIAMOND OA

Unified Vision-Language Pre-Training for Image Captioning and VQA

LZLuowei Zhou HPHamid Palangi LZLei Zhang HHHoudong Hu JJJason J. Corso

University of Michigan · Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the…

Citation impact

833

total citations

FWCI: 47.86
Percentile: 100%
References: 54

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Closed captioning
Computer science
Transformer
Question answering
Language model
Decoding methods
Artificial intelligence
Natural language processing

UN Sustainable Development Goals

Quality Education

No related works found for this paper.