articleJun 16, 2024Closed access
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Indexed incrossref
Abstract
Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT back-bone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid…
Citation impact
109
total citations
- FWCI
- 25.35
- Percentile
- 100%
- References
- 68
Citations per year
Authors
5Topics & keywords
Topics
Keywords
- Computer science
- Transformer
- Artificial intelligence
- Feature (linguistics)
- Computer vision
- Pattern recognition (psychology)
- Feature extraction
- Convolutional neural network
No related works found for this paper.