MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Shenzhen Institutes of Advanced Technology · Chinese Academy of Sciences · +5 more institutions
Abstract
With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models. However, most bench-marks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 chal-lenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic…
Citation impact
- FWCI
- 36.54
- Percentile
- 100%
- References
- 134
Authors
12- KLKunchang LiCorresponding
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
- YWYali Wang
Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology, Shanghai Artificial Intelligence Laboratory, ShangHai JiAi Genetics & IVF Institute
- YHYinan He
ShangHai JiAi Genetics & IVF Institute, Shanghai Artificial Intelligence Laboratory
- YLYizhuo Li
ShangHai JiAi Genetics & IVF Institute, Shanghai Artificial Intelligence Laboratory
- YWYi Wang
Shenzhen Institutes of Advanced Technology, Shanghai Artificial Intelligence Laboratory, ShangHai JiAi Genetics & IVF Institute, Chinese Academy of Sciences
Topics & keywords
- Benchmark (surveying)
- Computer science
- Modal
- Geology