articleJan 1, 2016GOLD OA

Sequence-Level Knowledge Distillation

Harvard University

Indexed incrossref

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher…

Citation impact

774
total citations
FWCI
57.42
Percentile
100%
References
62
Citations per year

Authors

2

Topics & keywords

Keywords
  • Distillation
  • Pruning
  • Computer science
  • Beam search
  • Machine translation
  • Sequence (biology)
  • Artificial intelligence
  • Baseline (sea)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.