Vision-Language Models for Vision Tasks: A Survey

Zhang, J; Huang, Jiaxing; Jin, Sheng; Lu, Shijian

doi:10.1109/tpami.2024.3369699

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceFeb 26, 2024Closed access

Vision-Language Models for Vision Tasks: A Survey

JZJ Zhang JHJiaxing Huang SJSheng Jin SLShijian Lu

Nanyang Technological University

PubMed

Indexed incrossrefpubmed

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that…

Citation impact

703

total citations

FWCI: 155.95
Percentile: 100%
References: 236

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Machine learning
Categorization
Task (project management)
Benchmarking
Task analysis

No related works found for this paper.