VOLO: Vision Outlooker for Visual Recognition

Yuan, Li; Hou, Qibin; Jiang, Zihang; Feng, Jiashi; Yan, Shuicheng

doi:10.1109/tpami.2022.3206108

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceJan 1, 2022Closed access

VOLO: Vision Outlooker for Visual Recognition

LYLi YuanQHQibin Hou ZJZihang Jiang JFJiashi Feng SYShuicheng Yan

Peking University · Peng Cheng Laboratory · +2 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic…

Citation impact

269

total citations

FWCI: 26.39
Percentile: 100%
References: 126

Citations per year

Authors

5

LY
Li YuanCorresponding
Peking University, Peng Cheng Laboratory
QH
Qibin Hou
Nankai University
ZJ
Zihang Jiang
National University of Singapore
JF
Jiashi Feng
SY
Shuicheng Yan

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Computation
Pattern recognition (psychology)
Bottleneck
Feature (linguistics)
Security token
Transformer

UN Sustainable Development Goals

Industry, innovation and infrastructure

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Award: 62202014