articleJun 1, 2023Closed access

Scaling Language-Image Pre-Training via Masking

Indexed incrossref

Abstract

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling…

Citation impact

217
total citations
FWCI
25.03
Percentile
100%
References
89
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Masking (illustration)
  • Speedup
  • Memory footprint
  • Scaling
  • Training (meteorology)
  • Image (mathematics)
  • Contrast (vision)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.