TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Midea Group (China) · East China Normal University · +3 more institutions
Abstract
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this letter, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy…
Citation impact
- FWCI
- 46.35
- Percentile
- 100%
- References
- 44
Authors
13- JWJunjie WenCorresponding
Midea Group (China), East China Normal University
- YZYichen Zhu
Midea Group (China)
- JLJinming Li
Midea Group (China)
- MZMinjie Zhu
Midea Group (China), East China Normal University
- ZTZhibin Tang
Midea Group (China)
Topics & keywords
- Action (physics)
- Computer science
- Artificial intelligence
- Human–computer interaction
- Computer vision