Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Fu, Chaoyou; Dai, Yuhan; Luo, Yongdong; Li, Lei; Ren, Shuhuai; Zhang, Renrui; Wang, Zihan; Zhou, Chenyu; Shen, Yunhang; Zhang, Mengdan; Chen, Peixian; Li, Yanwei; Lin, Shaohui; Zhao, Sirui; Li, Ke; Xu, Tong; Zheng, Xiawu; Chen, Enhong; Shan, Caifeng; He, Ran; Sun, Xing

doi:10.1109/cvpr52734.2025.02245

articleJun 10, 2025Closed access

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CFChaoyou Fu YDYuhan Dai YLYongdong Luo LLLei Li SRShuhuai Ren

Nanjing University · University of Hong Kong · +1 more institution

Indexed incrossref

Abstract

In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs to process sequential visual data is still insufficiently explored, highlighting the lack of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to…

Citation impact

67

total citations

FWCI: 129.10
Percentile: 100%
References: 0

Citations per year

Authors

21

Topics & keywords

Topics

Keywords

Benchmark (surveying)
Computer science
Modal
Artificial intelligence
Geography
Cartography

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China