Keywords: knowledge distillation, Large Language models
Abstract: Knowledge Distillation (KD) is a key technique for enhancing the capabilities of student models by transferring knowledge from powerful teachers. In Large Language Models (LLMs), however, the effectiveness of this transfer is fundamentally limited by distributional mismatch. The generic data used for distillation often fails to reflect the specialized distribution underpinning core expertise of the teacher. This gap hinders the acquisition of the teacher's most valuable capabilities. The challenge is fundamental because the ideal corrective method, importance weighting, is intractable without access to the unknown target density.
We propose Discrepancy Aware Knowledge Distillation (DAKD), a framework that re-frames this problem. Instead of estimating the unknown distribution, DAKD approximates the ideal importance weights by measuring the predictive discrepancy between the full teacher and a pre-trained-only base teacher, which serves as a distributional probe. The DAKD framework is "discrepancy aware" in a dual sense. It leverages the teacher-base divergence for distributional correction while using the teacher-student divergence for adaptive learning focus. This re-weighting is applied across multiple granularities, from the sequence and position down to the vocabulary level. Extensive experiments show that DAKD substantially outperforms state-of-the-art methods, enabling student models to more effectively inherit the nuanced capabilities of more powerful teachers.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11759
Loading