articleJun 16, 2024Closed access

VILA: On Pre-training for Visual Language Models

Indexed incrossref

Abstract

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial…

Citation impact

165
total citations
FWCI
37.02
Percentile
100%
References
96
Citations per year

Authors

6

Topics & keywords

Keywords
  • Training (meteorology)
  • Computer science
  • Natural language processing
  • Artificial intelligence
  • Geography
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.