Predicting Weak-to-Strong Generalization from Latent Representations

Ben Wilop; Christian Schroeder de Witt; Yarin Gal; Philip Torr; Constantin Venhoff

Predicting Weak-to-Strong Generalization from Latent Representations

Ben Wilop, Christian Schroeder de Witt, Yarin Gal, Philip Torr, Constantin Venhoff

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Applications of interpretability, Probing

TL;DR: We train generative models on board games and probe their internal representations to predict weak-to-strong generalization success.

Abstract: AI alignment seeks to align models with human values, yet humans may be unable to evaluate whether a model is aligned on tasks exceeding human capabilities. Weak-to-strong generalization (WSG) has been proposed as a proxy for studying this problem, where a weaker model stands in for human evaluation of a stronger model. While prior work provides evidence of WSG success, it suffers from train-test contamination or relies on oversimplified linear models. We introduce a clean testbed where transformer model pairs are pretrained on different variants of Othello and Tic-Tac-Toe, then the stronger model is finetuned on data from the weaker model. Using mechanistic interpretability techniques, we demonstrate that the stronger model outperforms the weaker model if and only if it has better board representations. Across 111 WSG pairs and 6 game rules, we find a 0.844 Spearman correlation between WSG success and superior board representations in the strong model as measured by linear probes. By open-sourcing our code, models and probes, we hope to accelerate research on interpreting WSG.

Submission Number: 197

Loading