Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

Published: 01 Jan 2024, Last Modified: 08 Oct 2025NLPCC (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model’s capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprog-xhy/MoE-KD-release.git.
Loading