MetaFormer is Actually What You Need for Vision

National University of Singapore

Indexed incrossref

Abstract

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model,…

Citation impact

1,127
total citations
FWCI
208.82
Percentile
100%
References
94
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Computer vision
  • Artificial intelligence
UN Sustainable Development Goals
  • Industry, innovation and infrastructure
No related works found for this paper.