MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
Indexed inarxivdatacite
Abstract
Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose…
Citation impact
635
total citations
- FWCI
- —
- Percentile
- —
- References
- 57
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Transformer
- Computer science
- Distillation
- Task (project management)
- Artificial intelligence
- Machine learning
- Engineering
- Chemistry
No related works found for this paper.