VILA: On Pre-training for Visual Language Models

Ji, Lin; Yin, Hongxu; Wei, Ping; Molchanov, Pavlo; Shoeybi, Mohammad; Han, Song

doi:10.1109/cvpr52733.2024.02520

articleJun 16, 2024Closed access

VILA: On Pre-training for Visual Language Models

LJLin Ji HYHongxu Yin PWPing Wei PMPavlo Molchanov MSMohammad Shoeybi

Indexed incrossref

Abstract

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial…

Citation impact

165

total citations

FWCI: 37.02
Percentile: 100%
References: 96

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Training (meteorology)
Computer science
Natural language processing
Artificial intelligence
Geography

UN Sustainable Development Goals

Quality Education

No related works found for this paper.