CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention
Zhejiang University · Zhejiang Lab · +2 more institutions
Abstract
While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens.…
Citation impact
- FWCI
- 31.22
- Percentile
- 100%
- References
- 66
Authors
8Topics & keywords
- Transformer
- Computer science
- Artificial intelligence
- Embedding
- Segmentation
- Computer vision
- Image segmentation
- Security token