Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Li, Gen; Duan, Nan; Fang, Yuejian; Gong, Ming; Jiang, Daxin

doi:10.1609/aaai.v34i07.6795

articleProceedings of the AAAI Conference on Artificial IntelligenceApr 3, 2020DIAMOND OA

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

GLGen Li NDNan Duan YFYuejian Fang MGMing Gong DJDaxin Jiang

Peking University · Microsoft Research Asia (China) · +1 more institution

Indexed incrossref

Abstract

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and…

Citation impact

744

total citations

FWCI: 45.24
Percentile: 100%
References: 44

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Modal
Transformer
Encoder
Natural language processing
Artificial intelligence
Commonsense reasoning
Task (project management)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 61672062, 61232005