Keywords: Chain of Thought/Reasoning models
TL;DR: We reverse-engineer how a R1-trained model does self-verification.
Abstract: How do reasoning models verify their own answers?
We study this question by training a model using DeepSeek R1's recipe on the CountDown task.
We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences.
With this setup, we do top-down and bottom-up analyses to reverse-engineer how the model verifies its outputs.
Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''.
Bottom-up, we find that ``previous-token heads'' are mainly responsible for self-verification in our setup.
Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as six attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit.
Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.
Submission Number: 52
Loading