Training data-efficient image transformers & distillation through attention
Indexed inarxiv
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the…
Citation impact
1,049
total citations
- FWCI
- 59.76
- Percentile
- 100%
- References
- 61
Citations per year
Authors
6Topics & keywords
Keywords
- Transformer
- Computer science
- Distillation
- Limiting
- Artificial intelligence
- Artificial neural network
- Machine learning
- Security token
UN Sustainable Development Goals
- Industry, innovation and infrastructure
No related works found for this paper.