Abstract

The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather…

Citation impact

6,915
total citations
FWCI
365.25
Percentile
100%
References
121
Citations per year

Authors

6

Topics & keywords

Keywords
  • Transformer
  • Computer science
  • Artificial intelligence
  • Segmentation
  • Scalability
  • Object detection
  • Machine learning
  • Image segmentation
No related works found for this paper.