TinyBERT: Distilling BERT for Natural Language Understanding
Wuhan National Laboratory for Optoelectronics · Huazhong University of Science and Technology · +2 more institutions
Abstract
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for…
Citation impact
- FWCI
- 145.39
- Percentile
- 100%
- References
- 46
Authors
8Topics & keywords
- Computer science
- Transformer
- Distillation
- Inference
- Benchmark (surveying)
- Language model
- Artificial intelligence
- Natural language understanding
- Quality Education