Video Understanding With Large Language Models: A Survey

Tang, Yunlong; Bi, Jing; Xu, Siting; Song, Luchuan; Liang, Susan; Wang, Teng; Zhang, Daoan; An, Jie; Lin, Jingyang; Zhu, Rongyi; Vosoughi, Ali; Huang, Chao; Zhang, Zeliang; Liu, Pinxin; Feng, Mingqian; Zheng, Feng; Zhang, Jianguo; Luo, Ping; Luo, Jiebo; Xu, Chenliang

doi:10.1109/tcsvt.2025.3566695

articleIEEE Transactions on Circuits and Systems for Video TechnologyMay 2, 2025Closed access

Video Understanding With Large Language Models: A Survey

YTYunlong TangJBJing BiSXSiting Xu LSLuchuan Song SLSusan Liang

University of Rochester · Southern University of Science and Technology · +1 more institution

Indexed incrossref

Abstract

With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and…

Citation impact

59

total citations

FWCI: 58.64
Percentile: 100%
References: 202

Citations per year

Authors

20

YT
Yunlong TangCorresponding
University of Rochester
JB
Jing Bi
University of Rochester
SX
Siting Xu
Southern University of Science and Technology
LS
Luchuan Song
University of Rochester
SL
Susan Liang
University of Rochester

Topics & keywords

Topics

Keywords

Computer science
Natural language processing
Artificial intelligence

UN Sustainable Development Goals

Quality Education

No related works found for this paper.