articleJan 1, 2020GOLD OA

TinyBERT: Distilling BERT for Natural Language Understanding

Wuhan National Laboratory for Optoelectronics · Huazhong University of Science and Technology · +2 more institutions

Indexed incrossref

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for…

Citation impact

1,620
total citations
FWCI
145.39
Percentile
100%
References
46
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Distillation
  • Inference
  • Benchmark (surveying)
  • Language model
  • Artificial intelligence
  • Natural language understanding
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding