BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Li, Junnan; Li, Dongxu; Xiong, Caiming; Hoi, Steven C. H.

doi:10.48550/arxiv.2201.12086

preprintarXiv (Cornell University)Jan 28, 2022GREEN OA

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

JLJunnan Li DLDongxu Li CXCaiming Xiong SCSteven C. H. Hoi

Indexed inarxivdatacite

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve…

Citation impact

865

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Closed captioning
Bootstrapping (finance)
Computer science
Language model
Generalization
Artificial intelligence
Code (set theory)
Natural language processing

No related works found for this paper.