SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Chen, Boyuan; Xu, Zhuo; Kirmani, Sean; Ichter, Brian; Sadigh, Dorsa; Guibas, Leonidas; Xia, Fei

doi:10.1109/cvpr52733.2024.01370

articleJun 16, 2024Closed access

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

BCBoyuan Chen ZXZhuo Xu SKSean Kirmani BIBrian Ichter DSDorsa Sadigh

Google (United States) · DeepMind (United Kingdom)

Indexed incrossref

Abstract

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size difference. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data…

Citation impact

160

total citations

FWCI: 35.80
Percentile: 100%
References: 85

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Spatial intelligence
Artificial intelligence
Natural language processing
Human–computer interaction

No related works found for this paper.