A Survey on Vision–Language–Action Models for Embodied AI

Ma, Yueen; Song, Zixing; Zhuang, Yuzheng; Hao, Jianye; King, Irwin

doi:10.1109/tnnls.2025.3650584

preprintIEEE Transactions on Neural Networks and Learning SystemsJan 1, 2026GREEN OA

A Survey on Vision–Language–Action Models for Embodied AI

YMYueen Ma ZSZixing Song YZYuzheng Zhuang JHJianye Hao IKIrwin King

Chinese University of Hong Kong · University of Bristol · +1 more institution

PubMed

Indexed inarxivcrossrefdatacitepubmed

Abstract

Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models-referred to as vision-language-action (VLA) models-has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs,…

Citation impact

11

total citations

FWCI: 0.00
Percentile: 99%
References: 0

Citations per year

Authors

5

Topics & keywords

Topics

Multimodal Machine Learning Applications92%

Keywords

Embodied cognition
Action (physics)
Computer science
Cognitive science
Artificial intelligence
Natural language processing
Psychology
Physics

No related works found for this paper.