ConViT: improving vision transformers with soft convolutional inductive biases*
Indexed inarxivcrossref
Abstract
Abstract Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional…
Citation impact
711
total citations
- FWCI
- 68.10
- Percentile
- 100%
- References
- 79
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Computer science
- Inductive bias
- Locality
- Artificial intelligence
- Convolutional neural network
- Transformer
- Pattern recognition (psychology)
- Machine learning
No related works found for this paper.