Shunted Self-Attention via Multi-Scale Token Aggregation

Ren, Sucheng; Zhou, Daquan; He, Shengfeng; Feng, Jiashi; Wang, Xinchao

doi:10.1109/cvpr52688.2022.01058

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

Shunted Self-Attention via Multi-Scale Token Aggregation

SRSucheng Ren DZDaquan Zhou SHShengfeng He JFJiashi Feng XWXinchao Wang

South China University of Technology · National University of Singapore

Indexed incrossref

Abstract

Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range de-pendencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention (SSA), that allows ViTs to model the attentions at hybrid scales per…

Citation impact

328

total citations

FWCI: 18.30
Percentile: 100%
References: 52

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Security token
Computation
Transformer
Artificial intelligence
Computer engineering
Algorithm
Computer network

No related works found for this paper.