UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

Chinese Academy of Sciences · Shenzhen Institutes of Advanced Technology · +5 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can…

Citation impact

533
total citations
FWCI
58.76
Percentile
100%
References
167
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Convolution (computer science)
  • Artificial intelligence
  • Pattern recognition (psychology)
  • Computer vision
  • Artificial neural network
UN Sustainable Development Goals
  • Reduced inequalities
No related works found for this paper.

Funding