ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Lu, Jiasen; Batra, Dhruv; Parikh, Devi; Lee, Stefan

doi:10.48550/arxiv.1908.02265

preprintarXiv (Cornell University)Aug 6, 2019GREEN OA

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

JLJiasen Lu DBDhruv Batra DPDevi Parikh SLStefan Lee

Indexed inarxivdatacite

Abstract

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe…

Citation impact

1,673

total citations

FWCI: —
Percentile: —
References: 30

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Commonsense reasoning
Natural language processing
Artificial intelligence
Question answering
Task (project management)
Transformer
Natural language

UN Sustainable Development Goals

Quality Education

No related works found for this paper.