articleJan 1, 2023GOLD OA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Alibaba Group (China)

Indexed incrossref

Abstract

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn…

Citation impact

440
total citations
FWCI
49.97
Percentile
100%
References
33
Citations per year

Authors

3

Topics & keywords

Keywords
  • Computer science
  • Encoder
  • Audio visual
  • Embedding
  • Multimedia
  • Leverage (statistics)
  • Video processing
  • Visualization
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.