MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wang, Wenhui; Wei, Furu; Dong, Li; Bao, Hangbo; Yang, Nan; Zhou, Ming

doi:10.48550/arxiv.2002.10957

preprintarXiv (Cornell University)Feb 25, 2020GREEN OA

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

WWWenhui Wang FWFuru Wei LDLi Dong HBHangbo Bao NYNan Yang

Indexed inarxivdatacite

Abstract

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose…

Citation impact

635

total citations

FWCI: —
Percentile: —
References: 57

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Transformer
Computer science
Distillation
Task (project management)
Artificial intelligence
Machine learning
Engineering
Chemistry

No related works found for this paper.