VOLO: Vision Outlooker for Visual Recognition

Peking University · Peng Cheng Laboratory · +2 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

Recently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples. To overcome such limitations, we present a new simple and generic…

Citation impact

269
total citations
FWCI
26.39
Percentile
100%
References
126
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Computation
  • Pattern recognition (psychology)
  • Bottleneck
  • Feature (linguistics)
  • Security token
  • Transformer
UN Sustainable Development Goals
  • Industry, innovation and infrastructure
No related works found for this paper.

Funding