PointVLA: Injecting the 3D World Into Vision-Language-Action Models
Shanghai University · Midea Group (China)
Abstract
Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via alightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less…
Citation impact
- FWCI
- 97.23
- Percentile
- 99%
- References
- 20
Authors
5- CLChengmeng LiCorresponding
Shanghai University
- JWJunjie Wen
Midea Group (China)
- YPYaxin Peng
Shanghai University
- YPYan Peng
Shanghai University
- YZYichen Zhu
Midea Group (China)
Topics & keywords
- Point cloud
- Key (lock)
- Table (database)
- Modular design
- Robot
- RGB color model
- Action (physics)
- Point (geometry)
- Reduced inequalities