Swin Transformer V2: Scaling Up Capacity and Resolution

Liu, Ze; Hu, Han; Lin, Yutong; Yao, Zhuliang; Xie, Zhenda; Wei, Yixuan; Jia, Ning; Cao, Yue; Zhang, Zheng; Dong, Li; Wei, Furu; Guo, Baining

doi:10.1109/cvpr52688.2022.01170

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

Swin Transformer V2: Scaling Up Capacity and Resolution

ZLZe Liu HHHan Hu YLYutong Lin ZYZhuliang Yao ZXZhenda Xie

University of Science and Technology of China · Microsoft Research Asia (China) · +3 more institutions

Indexed incrossref

Abstract

We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet- V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. We tackle issues of training instability, and study how to effectively transfer models pre-trained at low resolutions to higher resolution ones. To this aim, several novel technologies are proposed: 1) a residual post…

Citation impact

2,172

total citations

FWCI: 118.01
Percentile: 100%
References: 83

Citations per year

Authors

12

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Transformer
Normalization (sociology)
Scaling
Segmentation
Computer vision
Pattern recognition (psychology)

No related works found for this paper.

Funding

M
Microsoft