From Compression to Generalization: Language Model Distillation With Grafting

Giung Nam; Juho Lee

From Compression to Generalization: Language Model Distillation With Grafting

Giung Nam, Juho Lee

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge distillation, autoregressive language models

Abstract: Knowledge distillation is a widely used technique in distilling large language models. It is applied both for strong-to-weak distillation, where large-scale flagship models serve as teachers to produce lightweight models suitable for deployment, and for weak-to-strong distillation, where previous-generation models contribute to the development of stronger next-generation models. From a model compression perspective of knowledge distillation, students may be encouraged to adopt mode-seeking behavior; however, for building generalizable generative language models, mode-covering behavior should also be considered. To address this, we conduct an experimental analysis and propose a simple yet effective grafting strategy, in which sequence trees generated at multiple temperatures for autoregressive modeling are combined into a single distillation target. Our extensive experiments demonstrate the effectiveness of the proposed grafting approach.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1738

Loading