Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Nanjing University · University of Hong Kong · +4 more institutions
Abstract
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foun-dation model (Intern VL), which scales up the vision foun-dation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level…
Citation impact
- FWCI
- 88.28
- Percentile
- 100%
- References
- 205
Authors
15Topics & keywords
- Computer science
- Foundation (evidence)
- Scaling
- Natural language processing
- Artificial intelligence
- Linguistics
- History
- Mathematics
- Quality Education