Knowledge Distillation for Large Language Models through Residual Learning

Knowledge Distillation for Large Language Models through Residual Learning

ICLR 2026 Conference Submission13988 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge distillation; large language models; residual learning; mixture-of-experts; cross-tokenizer knowledge distillation

TL;DR: The paper introduces a two-stage LLM distillation method that leverages a novel residual learning approach, enabling students to learn from teacher mistakes, with strong results even when teacher and student use different tokenizers.

Abstract: Knowledge distillation has become a crucial technique to transfer the capacities of large language models (LLMs) to smaller, more efficient models for practical deployment. While recent work exploits rich information from intermediate states of the teacher model for more effective knowledge transfer, imperfect knowledge from the teacher can also mislead student learning, restricting the student’s generalization capacity. In this work, we propose a two-stage distillation framework that is effective for diverse knowledge distillation scenarios. In the first stage, we pretrain projectors to extract and compress teacher knowledge into a low-dimensional vector space via self-reconstruction. In the second stage, we perform distillation with a hybrid objective that combines learning from the compressed teacher representations with standard supervised fine-tuning on ground-truth data. Our key innovation is residual learning for LLM distillation, where the student learns to make predictions based on the differential between its representations and projected states from the teacher. This approach encourages the student to further improve its representations beyond potentially erroneous teacher knowledge. For Mixture-of-Experts (MoE) teacher models, we further fuse the experts’ outputs using a self-attention mechanism for better utilizing the teacher knowledge. Moreover, to support the cross-tokenizer distillation setting, where the teacher and student models have different vocabularies, we adopt a cross-model attention mechanism that eliminates the need for explicit token alignment rules. Experimental results show the superior performance of our proposed framework under both same- and cross-tokenizer settings, demonstrating the effectiveness in preserving teacher knowledge and improving student generalization capability.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 13988

Loading