Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding
Institute for Advanced Study · Tsinghua University · +3 more institutions
Abstract
The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3d, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3d model on a synthetic Structured3D dataset, which is…
Citation impact
- FWCI
- 130.21
- Percentile
- 100%
- References
- 55
Authors
8- YYYuqi YangCorresponding
Institute for Advanced Study, Tsinghua University
- YGYuxiao Guo
Microsoft Research Asia (China), Tsinghua–Berkeley Shenzhen Institute, Tsinghua University
- JXJian-Yu Xiong
Tsinghua–Berkeley Shenzhen Institute, Microsoft Research Asia (China), Tsinghua University
- YLYang Liu
Tsinghua University, Microsoft Research Asia (China), Tsinghua–Berkeley Shenzhen Institute
- HPHao Pan
Tsinghua University, Tsinghua–Berkeley Shenzhen Institute, Microsoft Research Asia (China)
Topics & keywords
- Transformer
- Computer science
- Computer graphics (images)
- Computer graphics
- Artificial intelligence
- Computer vision
- Engineering
- Electrical engineering