Vision-Language Models for Vision Tasks: A Survey

Nanyang Technological University

PubMed
Indexed incrossrefpubmed

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that…

Citation impact

703
total citations
FWCI
155.95
Percentile
100%
References
236
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Machine learning
  • Categorization
  • Task (project management)
  • Benchmarking
  • Task analysis
No related works found for this paper.