Foundation Models Defining a New Era in Vision: A Survey and Outlook

Mohamed bin Zayed University of Artificial Intelligence · Australian National University · +6 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a…

Citation impact

183
total citations
FWCI
338.98
Percentile
100%
References
199
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Interpretability
  • Modalities
  • Human–computer interaction
  • Foundation (evidence)
  • Field (mathematics)
  • Vision science
No related works found for this paper.