A survey on multimodal large language models
University of Science and Technology of China · Suzhou University of Science and Technology · +2 more institutions
Abstract
Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we…
Citation impact
- FWCI
- 117.13
- Percentile
- 100%
- References
- 189
Authors
7Topics & keywords
- Computer science
- Linguistics
- Philosophy