CoCa: Contrastive Captioners are Image-Text Foundation Models
Indexed inarxivdatacite
Abstract
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which…
Citation impact
516
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Computer science
- Closed captioning
- Artificial intelligence
- Encoder
- Deep learning
- Transformer
- Speech recognition
- Language model
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.