PointVLA: Injecting the 3D World Into Vision-Language-Action Models

Li, Chengmeng; Wen, Junjie; Peng, Yaxin; Peng, Yan; Zhu, Yichen

doi:10.1109/lra.2026.3653303

articleIEEE Robotics and Automation LettersJan 12, 2026Closed access

PointVLA: Injecting the 3D World Into Vision-Language-Action Models

CLChengmeng LiJWJunjie WenYPYaxin PengYPYan PengYZYichen Zhu

Shanghai University · Midea Group (China)

Indexed incrossref

Abstract

Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via alightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less…

Citation impact

5

total citations

FWCI: 97.23
Percentile: 99%
References: 20

Citations per year

Authors

5

CL
Chengmeng LiCorresponding
Shanghai University
JW
Junjie Wen
Midea Group (China)
YP
Yaxin Peng
Shanghai University
YP
Yan Peng
Shanghai University
YZ
Yichen Zhu
Midea Group (China)

Topics & keywords

Topics

Keywords

Point cloud
Key (lock)
Table (database)
Modular design
Robot
RGB color model
Action (physics)
Point (geometry)

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 62225308, 12471501