AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Chen Zhang; Yang Yang; Qifan Wang; Jiahao Liu; Jingang Wang; Wei Wu; Dawei Song

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, Dawei Song

22 Sept 2022 (modified: 04 Aug 2025)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the teacher assistant, requiring many trials before identifying the optimal teacher assistant. To this end, we propose an Automatic Distillation Schedule (AutoDisc) for large language model compression to discover the optimal teacher assistant in only one trial. In particular, motivated by the finding that the performance of the student is positively correlated to the scale-performance tradeoff of the teacher assistant, AutoDisc designs a $\lambda$-Traddoff to measure the optimality of the teacher assistant. AutoDisc then yields the $\lambda$-Traddoffs of all teacher assistant candidates in an once-for-all optimization with two approximations. The optimal teacher assistant can be automatically selected by uncovering the best $\lambda$-Traddoff. AutoDisc is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved efficiency with similar or even better effectiveness of our AutoDisc compared to several state-of-the-art baselines. We further apply AutoDisc to a language model with over one billion parameters and show the scalability of AutoDisc.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/autodisc-automatic-distillation-schedule-for/code)

8 Replies

Loading