article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access
MetaFormer is Actually What You Need for Vision
National University of Singapore
Indexed incrossref
Abstract
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model,…
Citation impact
1,127
total citations
- FWCI
- 208.82
- Percentile
- 100%
- References
- 94
Citations per year
Authors
8Topics & keywords
Keywords
- Computer science
- Computer vision
- Artificial intelligence
UN Sustainable Development Goals
- Industry, innovation and infrastructure
No related works found for this paper.