articleJun 16, 2024Closed access

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Baidu (China)

Indexed incrossref

Abstract

Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT back-bone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid…

Citation impact

109
total citations
FWCI
25.35
Percentile
100%
References
68
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Artificial intelligence
  • Feature (linguistics)
  • Computer vision
  • Pattern recognition (psychology)
  • Feature extraction
  • Convolutional neural network
No related works found for this paper.