TinyBERT: Distilling BERT for Natural Language Understanding

Jiao, Xiaoqi; Yin, Yichun; Shang, Lifeng; Jiang, Xin; Chen, Xiao Dong; Li, Linlin; Wang, Fang; Liu, Qun

doi:10.18653/v1/2020.findings-emnlp.372

articleJan 1, 2020GOLD OA

TinyBERT: Distilling BERT for Natural Language Understanding

XJXiaoqi Jiao YYYichun Yin LSLifeng Shang XJXin Jiang XDXiao Dong Chen

Wuhan National Laboratory for Optoelectronics · Huazhong University of Science and Technology · +2 more institutions

Indexed incrossref

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for…

Citation impact

1,620

total citations

FWCI: 145.39
Percentile: 100%
References: 46

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Transformer
Distillation
Inference
Benchmark (surveying)
Language model
Artificial intelligence
Natural language understanding

UN Sustainable Development Goals

Quality Education

No related works found for this paper.