Shunted Self-Attention via Multi-Scale Token Aggregation
South China University of Technology · National University of Singapore
Abstract
Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range de-pendencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention (SSA), that allows ViTs to model the attentions at hybrid scales per…
Citation impact
- FWCI
- 18.30
- Percentile
- 100%
- References
- 52
Authors
5Topics & keywords
- Computer science
- Security token
- Computation
- Transformer
- Artificial intelligence
- Computer engineering
- Algorithm
- Computer network