MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Li, Kunchang; Wang, Yali; He, Yinan; Li, Yizhuo; Wang, Yi; Liu, Yi; Wang, Zun; Xu, Jilan; Chen, Guo; Lou, Ping; Wang, Limin; Qiao, Yu

doi:10.1109/cvpr52733.2024.02095

articleJun 16, 2024Closed access

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

KLKunchang Li YWYali Wang YHYinan He YLYizhuo Li YWYi Wang

Shenzhen Institutes of Advanced Technology · Chinese Academy of Sciences · +5 more institutions

Indexed incrossref

Abstract

With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models. However, most bench-marks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic…

Citation impact

161

total citations

FWCI: 36.54
Percentile: 100%
References: 134

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Benchmark (surveying)
Computer science
Modal
Geology

No related works found for this paper.