CoCa: Contrastive Captioners are Image-Text Foundation Models

Yu, Jiahui; Wang, Zirui; Vasudevan, Vijay K.; Yeung, Legg; Seyedhosseini, Mojtaba; Wu, Yonghui

doi:10.48550/arxiv.2205.01917

preprintarXiv (Cornell University)May 4, 2022GREEN OA

CoCa: Contrastive Captioners are Image-Text Foundation Models

JYJiahui Yu ZWZirui Wang VKVijay K. Vasudevan LYLegg Yeung MSMojtaba Seyedhosseini

Indexed inarxivdatacite

Abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which…

Citation impact

516

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Closed captioning
Artificial intelligence
Encoder
Deep learning
Transformer
Speech recognition
Language model

UN Sustainable Development Goals

Quality Education

No related works found for this paper.