articleComputational Visual MediaFeb 1, 2025DIAMOND OA

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Institute for Advanced Study · Tsinghua University · +3 more institutions

Indexed incrossrefdoaj

Abstract

The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3d, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3d model on a synthetic Structured3D dataset, which is…

No related works found for this paper.