Swin Transformer V2: Scaling Up Capacity and Resolution

University of Science and Technology of China · Microsoft Research Asia (China) · +3 more institutions

Indexed incrossref

Abstract

We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet- V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. We tackle issues of training instability, and study how to effectively transfer models pre-trained at low resolutions to higher resolution ones. To this aim, several novel technologies are proposed: 1) a residual post…

Citation impact

2,172
total citations
FWCI
118.01
Percentile
100%
References
83
Citations per year

Authors

12

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Transformer
  • Normalization (sociology)
  • Scaling
  • Segmentation
  • Computer vision
  • Pattern recognition (psychology)
No related works found for this paper.

Funding