Weak-to-strong Generalization via Formative Learning from Student Demonstrations & Teacher Evaluation
Abstract: As Large Language Models (LLMs) exceed human capabilities, providing reliable human feedback for evaluating and aligning them, via standard frameworks such as Reinforcement Learning from Human Feedback, becomes challenging. This raises a fundamental question: \textit{how can we leverage weaker (teacher) supervision to elicit the full capabilities of a stronger (student) model? This emerging paradigm, known as Weak-to-Strong (W2S) generalization, however, also introduces a key challenge as the strong student may ``overfit'' to the weak teacher's mistakes, resulting in a notable performance degradation compared to learning with ground-truth data. We show that this overfitting problem occurs because learning with weak supervision implicitly regularizes the strong student's policy toward the weak reference policy. Building on this insight, we propose a novel learning approach, called Weak Teacher \textbf{E}\textbf{v}aluation of Strong Student D\textbf{e}monstrations or \textsc{Eve}, to instead regularize the strong student toward its reference policy.\textsc{Eve}'s regularization intuitively elicits the strong student's knowledge through its own task demonstrations while relying on the weaker teacher to evaluate these demonstrations -- an instance of formative learning. Extensive empirical evaluations demonstrate that \textsc{Eve} significantly outperforms existing W2S learning approaches and exhibits significantly better robustness under unreliable feedback compared to contrastive learning methods such as Direct Preference Optimization.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Christopher_Mutschler1
Submission Number: 7022
Loading