preprintarXiv (Cornell University)May 4, 2022GREEN OA

CoCa: Contrastive Captioners are Image-Text Foundation Models

Indexed inarxivdatacite

Abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which…

Citation impact

516
total citations
FWCI
Percentile
References
0
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Closed captioning
  • Artificial intelligence
  • Encoder
  • Deep learning
  • Transformer
  • Speech recognition
  • Language model
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.