VOLO: Vision Outlooker for Visual Recognition
Peking University · Peng Cheng Laboratory · +2 more institutions
Abstract
Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic…
Citation impact
- FWCI
- 26.39
- Percentile
- 100%
- References
- 126
Authors
5- LYLi YuanCorresponding
Peking University, Peng Cheng Laboratory
- QHQibin Hou
Nankai University
- ZJZihang Jiang
National University of Singapore
- JFJiashi Feng
- SYShuicheng Yan
Topics & keywords
- Computer science
- Artificial intelligence
- Computation
- Pattern recognition (psychology)
- Bottleneck
- Feature (linguistics)
- Security token
- Transformer
- Industry, innovation and infrastructure