CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention

Zhejiang University · Zhejiang Lab · +2 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens.…

Citation impact

279
total citations
FWCI
31.22
Percentile
100%
References
66
Citations per year

Authors

8

Topics & keywords

Keywords
  • Transformer
  • Computer science
  • Artificial intelligence
  • Embedding
  • Segmentation
  • Computer vision
  • Image segmentation
  • Security token
No related works found for this paper.

Funding