Swin Transformer V2: Scaling Up Capacity and Resolution
University of Science and Technology of China · Microsoft Research Asia (China) · +3 more institutions
Abstract
We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536x1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet- V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. We tackle issues of training instability, and study how to effectively transfer models pre-trained at low resolutions to higher resolution ones. To this aim, several novel technologies are proposed: 1) a residual post…
Citation impact
- FWCI
- 118.01
- Percentile
- 100%
- References
- 83
Authors
12Topics & keywords
- Computer science
- Artificial intelligence
- Transformer
- Normalization (sociology)
- Scaling
- Segmentation
- Computer vision
- Pattern recognition (psychology)