Lifting the Curse of Capacity Gap in Distilling Large Language ModelsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Abstract: Large language models (LLMs) have shown compelling performance on various downstream tasks, but unfortunately require a tremendous amount of inference compute. Knowledge distillation finds a path to compress LLMs to small ones with a teacher-student paradigm. However, when capacity gap between the teacher and the student is large, a curse of capacity gap appears, invoking a deficiency in distilling LLMs. While a few studies have been investigated to fill the gap, the curse is not yet well tackled. To the demand, we aim at lifting the curse of capacity gap via enlarging the capacity of the student without notably increasing the inference compute. Largely motivated by sparse activation regime of mixture of experts (MoE), we propose a mixture of minimal experts (MiniMoE), which imposes extra parameters to the student but introduces almost no additional inference compute. Experimental results on GLUE and CoNLL demonstrate the curse of capacity gap is lifted by the magic of MiniMoE to a large extent.MiniMoE also achieves state-of-the-art performance at small FLOPs compared with a range of competitive baselines. With compression as much as ~50x, MiniMoE preserves 95% GLUE score of the teacher.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
18 Replies

Loading