Weak-to-strong Generalization via Formative Learning from Student Demonstrations & Teacher Evaluation

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG LongEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Weak-to-Strong generalization, Superalignment, Reinforcement Learning From Human Feedback, LLMs
Abstract: As Large Language Models (LLMs) exceed human capabilities, providing reliable human feedback for evaluating and aligning them, via standard frameworks such as Reinforcement Learning from Human Feedback, becomes challenging. This raises a fundamental question: how can we leverage weaker (teacher) supervision to elicit the full capabilities of a stronger (student) model? This emerging paradigm, known as Weak-to-Strong (W2S) generalization, however, also introduces a key challenge as the strong student may “overfit” to the weak teacher’s mistakes, resulting in a notable performance degradation compared to learning with ground-truth data. We show that this overfitting problem occurs because learning with weak supervision implicitly regularizes the strong student’s policy toward the weak reference policy. Building on this insight, we propose a novel learning approach, called Weak Teacher Evaluation of Strong Student Demonstrations or {Eve}, to instead regularize the strong student towards its reference policy. EVE’s regularization intuitively elicits the strong student’s knowledge through its own task demonstrations while relying on the weaker teacher to evaluate these demonstrations an instance of formative learning. Extensive empirical evaluations demonstrate that EVE significantly outperforms existing W2S learning approaches and exhibits significantly better robustness under unreliable feedback compared to naive SFT and refinement approaches.
Submission Number: 55
Loading