Keywords: mechanistic interpretability, reasoning
TL;DR: reverse-engineered how a reasoning model does self-verification
Abstract: How do reasoning models verify their own answers?
We study this question by training a model using DeepSeek R1's recipe on the CountDown task.
We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences.
With this setup, we do top-down and bottom-up analyses of how the model verifies its outputs.
Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''.
Bottom-up, we find that ``previous-token heads'' are responsible for self-verification in our setup.
Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit.
Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.
Paper Published: No
Paper Category: Long Paper
Demography: No, I do not identify with any of these affinity groups
Academic: Others
Academic Other: postdoc
Submission Number: 1
Loading