UniFormer: Unifying Convolution and Self-Attention for Visual Recognition
Chinese Academy of Sciences · Shenzhen Institutes of Advanced Technology · +5 more institutions
Abstract
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can…
Citation impact
- FWCI
- 58.76
- Percentile
- 100%
- References
- 167
Authors
8- KLKunchang LiCorresponding
Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology
- YWYali Wang
Chinese Academy of Sciences, Shenzhen Institutes of Advanced Technology
- JZJunhao Zhang
National University of Singapore
- PGPeng Gao
Beijing Academy of Artificial Intelligence, Shanghai Artificial Intelligence Laboratory
- GSGuanglu Song
Group Sense (China)
Topics & keywords
- Computer science
- Convolution (computer science)
- Artificial intelligence
- Pattern recognition (psychology)
- Computer vision
- Artificial neural network
- Reduced inequalities