articleJan 1, 2024GOLD OA
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Indexed incrossref
Abstract
Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding.Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models.However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from…
Citation impact
233
total citations
- FWCI
- 52.08
- Percentile
- 100%
- References
- 0
Citations per year
Authors
7Topics & keywords
Topics
Keywords
- Computer science
- Projection (relational algebra)
- Artificial intelligence
- Computer vision
- Representation (politics)
- Computer graphics (images)
- Algorithm
UN Sustainable Development Goals
- Reduced inequalities
No related works found for this paper.