Abstract: Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.
Lay Summary: As AI continues to improve, a key question is how weaker models—or even humans—can guide the learning of more powerful systems. Surprisingly, recent research has shown that a strong AI model can learn from a weaker one and still outperform it, a phenomenon known as Weak-to-Strong Generalization (W2SG). However, we still don’t fully understand how or when this works.
Our research offers a theoretical explanation for W2SG by analyzing how AI models internally represent information. We show that weak and strong models have distinct ways of representing data, and we derive a mathematical quantity that captures what the strong model can learn that the weak one cannot. This quantity determines how much performance gain is possible under W2SG. Our theory also explains when and how the strong model can correct the weak model’s errors—even when it is trained directly on those mistakes.
This insight has practical value: it allows us to estimate W2SG performance without requiring labeled data. We validate our approach on real-world tasks in chemistry and natural language processing, using 52 different language models. Our work could help make future AI systems more reliable and better aligned with human intent.
Primary Area: Theory->Deep Learning
Keywords: Weak-to-strong Generalization, Theory, Large Language Models, Alignment
Submission Number: 9312
Loading