Provable Weak-to-Strong Generalization via Overspecified Students and Underspecified Teachers

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: weak-to-strong generalization, optimization, knowledge distillation
TL;DR: This paper provides a theoretical characterization of weak-to-strong generalization for underspcified teachers and overspecified students.
Abstract: Weak-to-strong generalization, as introduced in Burns et al. 2023, describes the phenomenon that a strong student (e.g., GPT-4) trained solely on labels generated by a weaker teacher (e.g., GPT-2) can outperform the teacher’s performance. In this work, we study the underlying mechanism behind weak-to-strong generalization in a controlled setting based on random feature models, specifically two-layer neural networks with random features and trainable linear output layers. We consider a regime where the teacher is \emph{underspecified} and cannot recover the ground-truth function, while the student is \emph{overspecified} and capable of achieving exact recovery. Our analysis reveals that the teacher’s limited capacity leads to unavoidable errors in the subspace spanned by low-variance directions of the data covariance matrix, when the groundtruth function contains significant signal in these directions. In contrast, the student, when trained by gradient flow with fixed random features, exhibits slower convergence in these low-variance directions and hence can explicitly reduce the error of the weak teacher through early stopping. Therefore, the student can obtain improved generalization performance compared with the teacher and enjoy weak-to-strong generalization. Finally, we provide a theoretical characterization of the spectral conditions of the data covariance matrix, under which weak-to-strong generalization provably occurs.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 12582
Loading