Keywords: knowledge distillation, autoregressive language models
Abstract: Knowledge distillation is a widely used technique in distilling large language models. It is applied both for strong-to-weak distillation, where large-scale flagship models serve as teachers to produce lightweight models suitable for deployment, and for weak-to-strong distillation, where previous-generation models contribute to the development of stronger next-generation models. From a model compression perspective of knowledge distillation, students may be encouraged to adopt mode-seeking behavior; however, for building generalizable generative language models, mode-covering behavior should also be considered. To address this, we conduct an experimental analysis and propose a simple yet effective grafting strategy, in which sequence trees generated at multiple temperatures for autoregressive modeling are combined into a single distillation target. Our extensive experiments demonstrate the effectiveness of the proposed grafting approach.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1738
Loading