MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, Deyao; Chen, Jun; Shen, Xiaoqian; Li, Xiang; Elhoseiny, Mohamed

doi:10.48550/arxiv.2304.10592

preprintarXiv (Cornell University)Apr 20, 2023GREEN OA

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

DZDeyao Zhu JCJun Chen XSXiaoqian Shen XLXiang Li MEMohamed Elhoseiny

Indexed inarxivdatacite

Abstract

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an…

Citation impact

476

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Usability
Repetition (rhetorical device)
Modal
Code (set theory)
Language model
Encoder
Artificial intelligence

UN Sustainable Development Goals

Quality Education

No related works found for this paper.