Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks

Wang, Wenhui; Bao, Hangbo; Li, Dong; Björck, Johan; Peng, Zhiliang; Liu, Qiang; Aggarwal, Kriti; Mohammed, Owais Khan; Singhal, Saksham; Som, Subhojit; Wei, Furu

doi:10.1109/cvpr52729.2023.01838

articleJun 1, 2023Closed access

Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks

WWWenhui Wang HBHangbo Bao DLDong Li JBJohan Björck ZPZhiliang Peng

Microsoft (Finland)

Indexed incrossref

Abstract

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEIT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner. Experimental results show…

Citation impact

472

total citations

FWCI: 53.41
Percentile: 100%
References: 99

Citations per year

Authors

11

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Computer vision
Image (mathematics)
Foreign language
Machine vision
Natural language processing
Linguistics

UN Sustainable Development Goals

Quality Education

No related works found for this paper.