Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training
Peking University · Microsoft Research Asia (China) · +1 more institution
Abstract
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and…
Citation impact
- FWCI
- 45.24
- Percentile
- 100%
- References
- 44
Authors
5Topics & keywords
- Computer science
- Modal
- Transformer
- Encoder
- Natural language processing
- Artificial intelligence
- Commonsense reasoning
- Task (project management)
- Quality Education