BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, Junnan; Li, Dongxu; Savarese, Silvio; Hoi, Steven C. H.

doi:10.48550/arxiv.2301.12597

preprintarXiv (Cornell University)Jan 30, 2023GREEN OA

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

JLJunnan Li DLDongxu Li SSSilvio Savarese SCSteven C. H. Hoi

Indexed inarxivdatacite

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks,…

Citation impact

914

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Language model
Encoder
Artificial intelligence
Transformer
Natural language processing
Natural language
Image (mathematics)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.