Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Zhang, Hang; Li, Xin; Bing, Lidong

doi:10.18653/v1/2023.emnlp-demo.49

articleJan 1, 2023GOLD OA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

HZHang Zhang XLXin Li LBLidong Bing

Alibaba Group (China)

Indexed incrossref

Abstract

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn…

Citation impact

440

total citations

FWCI: 49.97
Percentile: 100%
References: 33

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Computer science
Encoder
Audio visual
Embedding
Multimedia
Leverage (statistics)
Video processing
Visualization

UN Sustainable Development Goals

Quality Education

No related works found for this paper.