The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration

The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration

ACL ARR 2025 May Submission71 Authors

07 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Weak-to-strong generalization, where weakly supervised strong models outperform their weaker teachers, offers a promising approach to aligning superhuman models with human values. To deepen the understanding of this approach, we provide theoretical insights into its capabilities and limitations. First, in the classification setting, we establish upper and lower generalization error bounds for the strong model, identifying the primary limitations as stemming from the weak model's generalization error and the optimization objective itself. Additionally, we derive lower and upper bounds on the calibration error of the strong model. These theoretical bounds reveal two critical insights: (1) the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions, and (2) the strong model's training process must strike a careful balance, as excessive optimization may lead to overfitting to the weak supervision. Finally, in the regression setting, we theoretically extend the work of Charikar et al. (2024) to a loss function based on KL divergence, offering guarantees that the strong student can outperform its weak teacher by at least the magnitude of their disagreement. The theory is validated through synthetic experiments.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: weak-to-strong generalization, generalization, calibration

Contribution Types: Theory

Languages Studied: English

Submission Number: 71

Loading