VisualBERT: A Simple and Performant Baseline for Vision and Language
Indexed inarxivdatacite
Abstract
We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and…
Citation impact
1,234
total citations
- FWCI
- —
- Percentile
- —
- References
- 37
Citations per year
Authors
5Topics & keywords
Topics
Keywords
- Computer science
- Transformer
- Image (mathematics)
- Language understanding
- Baseline (sea)
- Artificial intelligence
- Natural language processing
- Simple (philosophy)
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.