articleIEEE Robotics and Automation LettersJan 12, 2026Closed access

PointVLA: Injecting the 3D World Into Vision-Language-Action Models

CLChengmeng LiJWJunjie WenYPYaxin PengYPYan PengYZYichen Zhu

Shanghai University · Midea Group (China)

Indexed incrossref

Abstract

Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via alightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less…

Citation impact

5
total citations
FWCI
97.23
Percentile
99%
References
20
Citations per year

Authors

5

Topics & keywords

Keywords
  • Point cloud
  • Key (lock)
  • Table (database)
  • Modular design
  • Robot
  • RGB color model
  • Action (physics)
  • Point (geometry)
UN Sustainable Development Goals
  • Reduced inequalities
No related works found for this paper.

Funding