preprintarXiv (Cornell University)Apr 17, 2023GREEN OA

Visual Instruction Tuning

Indexed inarxivdatacite

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal…

Citation impact

676
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Encoder
  • Code (set theory)
  • Field (mathematics)
  • Language model
  • Artificial intelligence
  • Natural language processing
  • Programming language
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.