An Empirical Study of Training End-to-End Vision-and-Language Transformers

Dou, Zi-Yi; Xu, Yichong; Gan, Zhe; Wang, Jianfeng; Wang, Shuohang; Wang, Lijuan; Zhu, Chenguang; Zhang, Pengchuan; Yuan, Lu; Peng, Nanyun; Liu, Zicheng; Zeng, Michael

doi:10.1109/cvpr52688.2022.01763

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

An Empirical Study of Training End-to-End Vision-and-Language Transformers

ZDZi-Yi Dou YXYichong Xu ZGZhe Gan JWJianfeng Wang SWShuohang Wang

University of California, Los Angeles · Microsoft (Finland)

Indexed incrossref

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, De-BERTa), multimodal fusion module (e.g., merged attention vs.…

Citation impact

314

total citations

FWCI: 17.83
Percentile: 100%
References: 93

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Transformer
Encoder
Computer science
End-to-end principle
Language model
Artificial intelligence
Computer engineering
Voltage

No related works found for this paper.