preprintarXiv (Cornell University)Feb 25, 2020GREEN OA

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Indexed inarxivdatacite

Abstract

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose…

Citation impact

635
total citations
FWCI
Percentile
References
57
Citations per year

Authors

6

Topics & keywords

Keywords
  • Transformer
  • Computer science
  • Distillation
  • Task (project management)
  • Artificial intelligence
  • Machine learning
  • Engineering
  • Chemistry
No related works found for this paper.