Video Understanding With Large Language Models: A Survey
University of Rochester · Southern University of Science and Technology · +1 more institution
Abstract
With the rapid growth of online video platforms and the escalating volume of video content, the need for proficient video understanding tools has increased significantly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advances in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (abstract, temporal, and spatiotemporal) reasoning combined with common-sense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and…
Citation impact
- FWCI
- 58.64
- Percentile
- 100%
- References
- 202
Authors
20Topics & keywords
- Computer science
- Natural language processing
- Artificial intelligence
- Quality Education