CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention

Wang, Wenxiao; Chen, Wei; Qiu, Qibo; Chen, Long; Wu, Boxi; Lin, Binbin; He, Xiaofei; Liu, Wei

doi:10.1109/tpami.2023.3341806

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceDec 19, 2023Closed access

CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention

WWWenxiao Wang WCWei Chen QQQibo Qiu LCLong Chen BWBoxi Wu

Zhejiang University · Zhejiang Lab · +2 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens.…

Citation impact

279

total citations

FWCI: 31.22
Percentile: 100%
References: 66

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Transformer
Computer science
Artificial intelligence
Embedding
Segmentation
Computer vision
Image segmentation
Security token

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 62273303, 62303406