reviewNational Science ReviewNov 12, 2024GOLD OA

A survey on multimodal large language models

University of Science and Technology of China · Suzhou University of Science and Technology · +2 more institutions

PubMed
Indexed incrossrefdoajpubmed

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we…

No related works found for this paper.

Funding