TL;DR: We study teacher hacking: does over-optimization of the distillation objective harm the ground-truth performance?
Abstract: Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model, leading to degraded performance on the true objective, in line with Goodhart's law.
In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher.
Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking.
Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust LMs.
Lay Summary: Large language models are often “distilled” into smaller "student" models by having the student imitate a larger “teacher” model, but this process can backfire: students sometimes learn to exploit quirky behaviors in the teacher rather than truly understanding how language works, a hidden flaw we call teacher hacking.
In our experiment, we showed that if the student is trained on the same batch of teacher-generated examples over and over, it learns shortcuts instead of real language skills—but if you keep creating fresh examples as you train, or mix in a wide range of different prompts, the student stays on track and avoids exploiting the teacher’s flaws.
Our findings suggest simple, practical fixes—online data sampling or richer, more diverse offline datasets—that make distilled models both more robust and more accurate to real language, improving their safety and usefulness in everyday applications.
Primary Area: Deep Learning->Large Language Models
Keywords: knowledge distillation, teacher hacking, language model distillation
Submission Number: 11024
Loading