Keywords: AI Safety, Applications of interpretability, Probing
TL;DR: We train generative models on board games and probe their internal representations to predict weak-to-strong generalization success.
Abstract: AI alignment seeks to align models with human values, yet humans may be unable to evaluate whether a model is aligned on tasks exceeding human capabilities. Weak-to-strong generalization (WSG) has been proposed as a proxy for studying this problem, where a weaker model stands in for human evaluation of a stronger model. While prior work provides evidence of WSG success, it suffers from train-test contamination or relies on oversimplified linear models. We introduce a clean testbed where transformer model pairs are pretrained on different variants of Othello and Tic-Tac-Toe, then the stronger model is finetuned on data from the weaker model. Using mechanistic interpretability techniques, we demonstrate that the stronger model outperforms the weaker model if and only if it has better board representations. Across 111 WSG pairs and 6 game rules, we find a 0.844 Spearman correlation between WSG success and superior board representations in the strong model as measured by linear probes. By open-sourcing our code, models and probes, we hope to accelerate research on interpreting WSG.
Submission Number: 197
Loading