VLP: A Survey on Vision-language Pre-training

Chen, Feilong; Zhang, Duzhen; Han, Minglun; Chen, Xiu-Yi; Shi, Jing; Xu, Shuang; Xu, Bo

doi:10.1007/s11633-022-1369-5

articleMachine Intelligence ResearchJan 10, 2023HYBRID OA

VLP: A Survey on Vision-language Pre-training

FCFeilong Chen DZDuzhen Zhang MHMinglun Han XCXiu-Yi Chen JSJing Shi

Chinese Academy of Sciences · Shandong Institute of Automation · +1 more institution

Indexed inarxivcrossref

Abstract

Abstract In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model…

Citation impact

232

total citations

FWCI: 25.50
Percentile: 100%
References: 156

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
GRASP
Modal
Artificial intelligence
Field (mathematics)
Scratch
Architecture
Feature (linguistics)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

CA
Chinese Academy of Sciences
Award: ZDBS-SSW-JSC006