Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Distillation, AI-Feedback, large language model
Abstract: **Distillation of large language models (LLMs) has traditionally focused on transferring teacher responses, often assuming access to internal logits. In modern LLM deployment, however, the teacher is typically only accessible as a black-box API or is too large to support online distillation, while simultaneously possessing strong evaluative capabilities that remain underexploited.** As a result, students learn what to answer, but not which answers are preferable. This gap limits generalization, propagates teacher errors, and prevents students from improving beyond imitation. Therefore, we propose a unified distillation framework that transfers both responses and evaluation ability. Our key idea is to distill reward signals from the teacher, eliminating the need for costly human annotations. However, extracting reliable reward signals from LLMs is challenging because they are optimized for generation rather than evaluation. Therefore, we introduce an adaptive reward distillation strategy that applies majority voting for verifiable tasks and LLM-as-Judge for open-ended tasks. This yields noisy yet effective self-supervised signals without human annotations. To mitigate distribution shift, we systematically collect and label both teacher- and student-generated responses, which are used to train a reward model. The student is first warmed up with supervised fine-tuning on high-quality teacher responses, then refined with reinforcement learning guided by the learned reward model. Experiments on GSM8K, GSM-Plus, MMLU-Pro, and AlpacaEval2 demonstrate consistent gains over supervised fine-tuning, with smaller students in some cases even surpassing their teachers. These results highlight our method as a scalable and effective paradigm for training efficient yet competitive LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12493
Loading