A survey on multimodal large language models

Yin, Shukang; Fu, Chaoyou; Zhao, Sirui; Li, Ke; Sun, Xing; Xu, Tong; Chen, Enhong

doi:10.1093/nsr/nwae403

reviewNational Science ReviewNov 12, 2024GOLD OA

A survey on multimodal large language models

SYShukang Yin CFChaoyou Fu SZSirui Zhao KLKe Li XSXing Sun

University of Science and Technology of China · Suzhou University of Science and Technology · +2 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we…

Citation impact

525

total citations

FWCI: 117.13
Percentile: 100%
References: 189

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Linguistics
Philosophy

No related works found for this paper.