preprintarXiv (Cornell University)Jan 28, 2022GREEN OA

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Indexed inarxivdatacite

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve…

Citation impact

865
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Closed captioning
  • Bootstrapping (finance)
  • Computer science
  • Language model
  • Generalization
  • Artificial intelligence
  • Code (set theory)
  • Natural language processing
No related works found for this paper.