Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

Hou, Qibin; Lu, Cheng-Ze; Cheng, Ming‐Ming; Feng, Jiashi

doi:10.1109/tpami.2024.3401450

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceMay 15, 2024Closed access

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

QHQibin Hou CLCheng-Ze Lu MCMing‐Ming Cheng JFJiashi Feng

Nankai University

PubMed

Indexed incrossrefpubmed

Abstract

Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (…

Citation impact

137

total citations

FWCI: 30.62
Percentile: 100%
References: 83

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Pattern recognition (psychology)
Transformer
Computer vision
Simple (philosophy)
Machine learning
Engineering

No related works found for this paper.