Foundation Models Defining a New Era in Vision: A Survey and Outlook

Awais, Muhammad; Naseer, Muzammal; Khan, Salman; Anwer, Rao Muhammad; Cholakkal, Hisham; Shah, Mubarak; Yang, Ming–Hsuan; Khan, Fahad Shahbaz

doi:10.1109/tpami.2024.3506283

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceJan 9, 2025Closed access

Foundation Models Defining a New Era in Vision: A Survey and Outlook

MAMuhammad AwaisMNMuzammal Naseer SKSalman Khan RMRao Muhammad Anwer HCHisham Cholakkal

Mohamed bin Zayed University of Artificial Intelligence · Australian National University · +6 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a…

Citation impact

183

total citations

FWCI: 338.98
Percentile: 100%
References: 199

Citations per year

Authors

8

MA
Muhammad AwaisCorresponding
Mohamed bin Zayed University of Artificial Intelligence
MN
Muzammal Naseer
Australian National University, Georgia Institute of Technology, Khalifa University of Science and Technology
SK
Salman Khan
Australian National University, Mohamed bin Zayed University of Artificial Intelligence
RM
Rao Muhammad Anwer
Mohamed bin Zayed University of Artificial Intelligence
HC
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Interpretability
Modalities
Human–computer interaction
Foundation (evidence)
Field (mathematics)
Vision science

No related works found for this paper.