preprintarXiv (Cornell University)May 11, 2023GREEN OA

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Indexed inarxivdatacite

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format.…

Citation impact

404
total citations
FWCI
Percentile
References
0
Citations per year

Authors

9

Topics & keywords

Keywords
  • Computer science
  • Language model
  • Artificial intelligence
  • Transformer
  • Natural language processing
  • Language understanding
  • Variety (cybernetics)
  • Machine learning
No related works found for this paper.