articleJun 16, 2024Closed access

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Shenzhen Institutes of Advanced Technology · Chinese Academy of Sciences · +5 more institutions

Indexed incrossref

Abstract

With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models. However, most bench-marks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic…

No related works found for this paper.

Funding