MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Indexed inarxivdatacite
Abstract
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an…
Citation impact
476
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
5Topics & keywords
Topics
Keywords
- Computer science
- Usability
- Repetition (rhetorical device)
- Modal
- Code (set theory)
- Language model
- Encoder
- Artificial intelligence
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.