ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Xia, Chunlong; Wang, Xinliang; Lv, Feng; Hao, Xin; Shi, Yifeng

doi:10.1109/cvpr52733.2024.00525

articleJun 16, 2024Closed access

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

CXChunlong Xia XWXinliang Wang FLFeng Lv XHXin Hao YSYifeng Shi

Baidu (China)

Indexed incrossref

Abstract

Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT back-bone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid…

Citation impact

109

total citations

FWCI: 25.35
Percentile: 100%
References: 68

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Transformer
Artificial intelligence
Feature (linguistics)
Computer vision
Pattern recognition (psychology)
Feature extraction
Convolutional neural network

No related works found for this paper.