MetaFormer is Actually What You Need for Vision

Yu, Weihao; Luo, Mi; Zhou, Pan; Si, Chenyang; Zhou, Yichen; Wang, Xinchao; Feng, Jiashi; Yan, Shuicheng

doi:10.1109/cvpr52688.2022.01055

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

MetaFormer is Actually What You Need for Vision

WYWeihao Yu MLMi Luo PZPan Zhou CSChenyang Si YZYichen Zhou

National University of Singapore

Indexed incrossref

Abstract

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model,…

Citation impact

1,127

total citations

FWCI: 208.82
Percentile: 100%
References: 94

Citations per year

Authors

8

Topics & keywords

Topics

Visual perception and processing mechanisms97%

Keywords

Computer science
Computer vision
Artificial intelligence

UN Sustainable Development Goals

Industry, innovation and infrastructure

No related works found for this paper.