Towards Automated Distillation: A Systematic Study of Knowledge Distillation in Natural Language Processing

Haoyu He; Xingjian Shi; Jonas Mueller; Sheng Zha; Mu Li; George Karypis

Towards Automated Distillation: A Systematic Study of Knowledge Distillation in Natural Language Processing

Haoyu He, Xingjian Shi, Jonas Mueller, Sheng Zha, Mu Li, George Karypis

25 Feb 2022 (modified: 05 May 2023)AutoML 2022 (Late-Breaking Workshop)Readers: Everyone

Abstract: Key factors underpinning the optimal Knowledge Distillation (KD) performance remain elusive as the effects of these factors are often confounded in sophisticated distillation algorithms. This poses a challenge for choosing the best distillation algorithm from the large design space for existing and new tasks alike and hinders automated distillation. In this work, we aim to identify how the distillation performance across different tasks is affected by the components in the KD pipeline, such as the data augmentation policy, the loss function, and the intermediate knowledge transfer between the teacher and the student. To isolate their effects, we propose Distiller, a meta-KD framework that systematically combines the key distillation techniques as components across different stages of the KD pipeline. Distiller enables us to quantify each component's contribution and conduct experimental studies to derive insights about distillation performance: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) the best-performed distillation algorithms are quite different across various tasks, and 3) data augmentation provides a large boost for small training datasets or small student networks. Based on these insights, we propose a simple AutoDistiller algorithm that can recommend a close-to-optimal KD pipeline for a new dataset/task. This is the first step toward automated KD that can save engineering costs and democratize practical KD applications.

Keywords: Meta Learning, Knowledge Distillation, AutoML

One-sentence Summary: A meta learning of knowledge distillation in natural language processing and an experimental study towards automated knowledge distillation framework.

Track: Main track

Reproducibility Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Reviewers: Haoyu He, he.haoy@northeastern.edu

CPU Hours: 0

GPU Hours: 0

TPU Hours: 0

Datasets And Benchmarks: GLUE, SQuAD

Performance Metrics: MSE, CE, Accuracy

Main Paper And Supplementary Material: pdf

0 Replies

Loading