Video Understanding With Large Language Models: A Survey

University of Rochester · Southern University of Science and Technology · +1 more institution

Indexed incrossref

Abstract

With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and…

Citation impact

59
total citations
FWCI
58.64
Percentile
100%
References
202
Citations per year

Authors

20

Topics & keywords

Keywords
  • Computer science
  • Natural language processing
  • Artificial intelligence
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.