Scaling Language-Image Pre-Training via Masking

Li, Yanghao; Fan, Haoqi; Hu, Ronghang; Feichtenhofer, Christoph; He, Kaiming

doi:10.1109/cvpr52729.2023.02240

articleJun 1, 2023Closed access

Scaling Language-Image Pre-Training via Masking

YLYanghao Li HFHaoqi Fan RHRonghang Hu CFChristoph Feichtenhofer KHKaiming He

Indexed incrossref

Abstract

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP [52]. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling…

Citation impact

217

total citations

FWCI: 25.03
Percentile: 100%
References: 89

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Masking (illustration)
Speedup
Memory footprint
Scaling
Training (meteorology)
Image (mathematics)
Contrast (vision)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.