Visual Instruction Tuning

Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae

doi:10.48550/arxiv.2304.08485

preprintarXiv (Cornell University)Apr 17, 2023GREEN OA

Visual Instruction Tuning

HLHaotian Liu CLChunyuan Li QWQingyang Wu YJYong Jae Lee

Indexed inarxivdatacite

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal…

Citation impact

676

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Encoder
Code (set theory)
Field (mathematics)
Language model
Artificial intelligence
Natural language processing
Programming language

UN Sustainable Development Goals

Quality Education

No related works found for this paper.