ConViT: improving vision transformers with soft convolutional inductive biases*

Indexed inarxivcrossref

Abstract

Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional…

Citation impact

711
total citations
FWCI
68.10
Percentile
100%
References
79
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Inductive bias
  • Locality
  • Artificial intelligence
  • Convolutional neural network
  • Transformer
  • Pattern recognition (psychology)
  • Machine learning
No related works found for this paper.