An Empirical Study of Training End-to-End Vision-and-Language Transformers

University of California, Los Angeles · Microsoft (Finland)

Indexed incrossref

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present Meter, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, De-BERTa), multimodal fusion module (e.g., merged attention vs.…

Citation impact

314
total citations
FWCI
17.83
Percentile
100%
References
93
Citations per year

Authors

12

Topics & keywords

Keywords
  • Transformer
  • Encoder
  • Computer science
  • End-to-end principle
  • Language model
  • Artificial intelligence
  • Computer engineering
  • Voltage
No related works found for this paper.