DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Indexed inarxivdatacite
Abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while…
Citation impact
4,582
total citations
- FWCI
- 344.58
- Percentile
- 100%
- References
- 18
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Leverage (statistics)
- Computer science
- Inference
- Language understanding
- Distillation
- Language model
- Computation
- Task (project management)
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.