articleJan 1, 2023GOLD OA
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Indexed incrossref
Abstract
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn…
Citation impact
440
total citations
- FWCI
- 49.97
- Percentile
- 100%
- References
- 33
Citations per year
Authors
3Topics & keywords
Topics
Keywords
- Computer science
- Encoder
- Audio visual
- Embedding
- Multimedia
- Leverage (statistics)
- Video processing
- Visualization
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.