preprintarXiv (Cornell University)Apr 20, 2023GREEN OA

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Indexed inarxivdatacite

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an…

Citation impact

476
total citations
FWCI
Percentile
References
0
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Usability
  • Repetition (rhetorical device)
  • Modal
  • Code (set theory)
  • Language model
  • Encoder
  • Artificial intelligence
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.