Visual Instruction Tuning
Indexed inarxivdatacite
Abstract
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal…
Citation impact
676
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Encoder
- Code (set theory)
- Field (mathematics)
- Language model
- Artificial intelligence
- Natural language processing
- Programming language
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.