ConViT: improving vision transformers with soft convolutional inductive biases*

d’Ascoli, Stéphane; Touvron, Hugo; Leavitt, Matthew L.; Morcos, Ari S.; Biroli, Giulio; Sagun, Levent

doi:10.1088/1742-5468/ac9830

articleJournal of Statistical Mechanics Theory and ExperimentNov 1, 2022GREEN OA

ConViT: improving vision transformers with soft convolutional inductive biases*

SDStéphane d’Ascoli HTHugo Touvron MLMatthew L. Leavitt ASAri S. Morcos GBGiulio Biroli

Indexed inarxivcrossref

Abstract

Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional…

Citation impact

711

total citations

FWCI: 68.10
Percentile: 100%
References: 79

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Inductive bias
Locality
Artificial intelligence
Convolutional neural network
Transformer
Pattern recognition (psychology)
Machine learning

No related works found for this paper.