A Survey on Vision–Language–Action Models for Embodied AI

Chinese University of Hong Kong · University of Bristol · +1 more institution

PubMed
Indexed inarxivcrossrefdatacitepubmed

Abstract

Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to as vision-language-action (VLA) models-has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs,…

Citation impact

11
total citations
FWCI
0.00
Percentile
99%
References
0
Citations per year

Authors

5

Topics & keywords

Keywords
  • Embodied cognition
  • Action (physics)
  • Computer science
  • Cognitive science
  • Artificial intelligence
  • Natural language processing
  • Psychology
  • Physics
No related works found for this paper.