A Survey on Vision–Language–Action Models for Embodied AI
Chinese University of Hong Kong · University of Bristol · +1 more institution
Abstract
Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to as vision-language-action (VLA) models-has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs,…
Citation impact
- FWCI
- 0.00
- Percentile
- 99%
- References
- 0
Authors
5Topics & keywords
- Embodied cognition
- Action (physics)
- Computer science
- Cognitive science
- Artificial intelligence
- Natural language processing
- Psychology
- Physics