The Geometry of Self-Verification in a Task-Specific Reasoning Model

Published: 24 Jul 2025, Last Modified: 04 Oct 2025XLLM-Reason-PlanEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, reasoning
TL;DR: reverse-engineered how a reasoning model does self-verification
Abstract: How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses of how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.
Paper Published: No
Paper Category: Long Paper
Demography: No, I do not identify with any of these affinity groups
Academic: Others
Academic Other: postdoc
Submission Number: 1
Loading