articleJun 1, 2023Closed access
Scaling Language-Image Pre-Training via Masking
Indexed incrossref
Abstract
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling…
Citation impact
217
total citations
- FWCI
- 25.03
- Percentile
- 100%
- References
- 89
Citations per year
Authors
5Topics & keywords
Topics
Keywords
- Computer science
- Masking (illustration)
- Speedup
- Memory footprint
- Scaling
- Training (meteorology)
- Image (mathematics)
- Contrast (vision)
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.