articleJun 16, 2024Closed access
VILA: On Pre-training for Visual Language Models
Indexed incrossref
Abstract
Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial…
Citation impact
165
total citations
- FWCI
- 37.02
- Percentile
- 100%
- References
- 96
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Training (meteorology)
- Computer science
- Natural language processing
- Artificial intelligence
- Geography
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.