UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

Li, Kunchang; Wang, Yali; Zhang, Junhao; Gao, Peng; Song, Guanglu; Liu, Yu; Li, Hongsheng; Qiao, Yu

doi:10.1109/tpami.2023.3282631

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceJun 5, 2023Closed access

UniFormer: Unifying Convolution and Self-Attention for Visual Recognition

KLKunchang Li YWYali Wang JZJunhao Zhang PGPeng Gao GSGuanglu Song

Chinese Academy of Sciences · Shenzhen Institutes of Advanced Technology · +5 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can…

Citation impact

533

total citations

FWCI: 58.76
Percentile: 100%
References: 167

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Convolution (computer science)
Artificial intelligence
Pattern recognition (psychology)
Computer vision
Artificial neural network

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.