FLAVA: A Foundational Language And Vision Alignment Model

Singh, Amanpreet; Hu, Ronghang; Goswami, Vedanuj; Couairon, Guillaume; Galuba, Wojciech; Rohrbach, Marcus; Kiela, Douwe

doi:10.1109/cvpr52688.2022.01519

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

FLAVA: A Foundational Language And Vision Alignment Model

ASAmanpreet Singh RHRonghang Hu VGVedanuj Goswami GCGuillaume Couairon WGWojciech Galuba

Meta (Israel)

Indexed incrossref

Abstract

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a “foundation”, that targets all modalities at once-a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks…

Citation impact

487

total citations

FWCI: 27.03
Percentile: 100%
References: 169

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Vision science
Artificial intelligence
Natural language processing
Cognitive science
Linguistics
Psychology
Philosophy

No related works found for this paper.