Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Lin, Bin; Ye, Yang; Zhu, Bin; Cui, Jiaxi; Ning, Munan; Jin, Peng; Li, Yuan

doi:10.18653/v1/2024.emnlp-main.342

articleJan 1, 2024GOLD OA

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

BLBin Lin YYYang Ye BZBin Zhu JCJiaxi Cui MNMunan Ning

Indexed incrossref

Abstract

Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding.Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models.However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from…

Citation impact

233

total citations

FWCI: 52.08
Percentile: 100%
References: 0

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Projection (relational algebra)
Artificial intelligence
Computer vision
Representation (politics)
Computer graphics (images)
Algorithm

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 62425101, 62332002, 62202014