TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Wen, Junjie; Zhu, Yichen; Li, Jinming; Zhu, Minjie; Tang, Zhibin; Wu, Kun; Xu, Zhiyuan; Liu, Ning; Cheng, Ran; Shen, Chaomin; Peng, Yaxin; Feng, Feifei; Tang, Jian

doi:10.1109/lra.2025.3544909

articleIEEE Robotics and Automation LettersFeb 24, 2025Closed access

TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

JWJunjie WenYZYichen ZhuJLJinming LiMZMinjie Zhu ZTZhibin Tang

Midea Group (China) · East China Normal University · +3 more institutions

Indexed incrossref

Abstract

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this letter, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy…

Citation impact

47

total citations

FWCI: 46.35
Percentile: 100%
References: 44

Citations per year

Authors

13

JW
Junjie WenCorresponding
Midea Group (China), East China Normal University
YZ
Yichen Zhu
Midea Group (China)
JL
Jinming Li
Midea Group (China)
MZ
Minjie Zhu
Midea Group (China), East China Normal University
ZT
Zhibin Tang
Midea Group (China)

Topics & keywords

Topics

Keywords

Action (physics)
Computer science
Artificial intelligence
Human–computer interaction
Computer vision

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Award: 12471501