Decoupled Knowledge Distillation
Megvii (China) · Vi Technology (United States) · +2 more institutions
Abstract
State-of-the-art distillation methods are mainly based on distilling deep features from intermediate layers, while the significance of logit distillation is greatly overlooked. To provide a novel viewpoint to study logit distillation, we re-formulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD). We empirically investigate and prove the effects of the two parts: TCKD transfers knowledge concerning the “difficulty” of training samples, while NCKD is the prominent reason why logit distillation works. More importantly, we reveal that the classical KD loss is a coupled formulation, which (1) suppresses the effectiveness of…
Citation impact
- FWCI
- 46.32
- Percentile
- 100%
- References
- 56
Authors
5Topics & keywords
- Distillation
- Computer science
- Flexibility (engineering)
- Class (philosophy)
- Artificial intelligence
- Feature (linguistics)
- Machine learning
- Image (mathematics)