InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Dai, Wenliang; Li, Junnan; Li, Dongxu; Tiong, Anthony Meng Huat; Zhao, Junqi; Wang, Weisheng; Li, Boyang; Fung, Pascale; Hoi, Steven C. H.

doi:10.48550/arxiv.2305.06500

preprintarXiv (Cornell University)May 11, 2023GREEN OA

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

WDWenliang Dai JLJunnan Li DLDongxu Li AMAnthony Meng Huat Tiong JZJunqi Zhao

Indexed inarxivdatacite

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format.…

Citation impact

404

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Computer science
Language model
Artificial intelligence
Transformer
Natural language processing
Language understanding
Variety (cybernetics)
Machine learning

No related works found for this paper.