Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Chen, Zhe; Wu, Jiannan; Wang, Wenhai; Su, Weijie; Chen, Guo; Xing, Sen; Zhong, Muyan; Zhang, Qinglong; Zhu, Xizhou; Lu, Lewei; Li, Bin; Luo, Ping; Lü, Tong; Qiao, Yu; Dai, Yifeng

doi:10.1109/cvpr52733.2024.02283

articleJun 16, 2024Closed access

Intern VL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

ZCZhe Chen JWJiannan Wu WWWenhai Wang WSWeijie Su GCGuo Chen

Nanjing University · University of Hong Kong · +4 more institutions

Indexed incrossref

Abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foun-dation model (Intern VL), which scales up the vision foun-dation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level…

Citation impact

394

total citations

FWCI: 88.28
Percentile: 100%
References: 205

Citations per year

Authors

15

Topics & keywords

Topics

Keywords

Computer science
Foundation (evidence)
Scaling
Natural language processing
Artificial intelligence
Linguistics
History
Mathematics

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

ER
European Research Consortium for Informatics and Mathematics
Award: 62376134,62372223