Sequence-Level Knowledge Distillation

Kim, Yoon; Rush, Alexander M.

doi:10.18653/v1/d16-1139

articleJan 1, 2016GOLD OA

Sequence-Level Knowledge Distillation

YKYoon Kim AMAlexander M. Rush

Harvard University

Indexed incrossref

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher…

Citation impact

774

total citations

FWCI: 57.42
Percentile: 100%
References: 62

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Distillation
Pruning
Computer science
Beam search
Machine translation
Sequence (biology)
Artificial intelligence
Baseline (sea)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.