Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yang, Yuqi; Guo, Yuxiao; Xiong, Jian-Yu; Liu, Yang; Pan, Hao; Wang, Peng‐Shuai; Tong, Xin; Guo, Baining

doi:10.26599/cvm.2025.9450383

articleComputational Visual MediaFeb 1, 2025DIAMOND OA

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

YYYuqi Yang YGYuxiao Guo JXJian-Yu Xiong YLYang Liu HPHao Pan

Institute for Advanced Study · Tsinghua University · +3 more institutions

Indexed incrossrefdoaj

Abstract

The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3d, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3d model on a synthetic Structured3D dataset, which is…

Citation impact

70

total citations

FWCI: 130.21
Percentile: 100%
References: 55

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Transformer
Computer science
Computer graphics (images)
Computer graphics
Artificial intelligence
Computer vision
Engineering
Electrical engineering

No related works found for this paper.