An Empirical Study of Training End-to-End Vision-and-Language Transformers
University of California, Los Angeles · Microsoft (Finland)
Abstract
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, De-BERTa), multimodal fusion module (e.g., merged attention vs.…
Citation impact
- FWCI
- 17.83
- Percentile
- 100%
- References
- 93
Authors
12Topics & keywords
- Transformer
- Encoder
- Computer science
- End-to-end principle
- Language model
- Artificial intelligence
- Computer engineering
- Voltage