The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee; Lihao Sun; Chris Wendler; Fernanda Viégas; Martin Wattenberg

The Geometry of Self-Verification in a Task-Specific Reasoning Model

Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg

Published: 24 Jul 2025, Last Modified: 04 Oct 2025XLLM-Reason-PlanEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, reasoning

TL;DR: reverse-engineered how a reasoning model does self-verification

Abstract: How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, yielding a model that always produces highly structured chain-of-thought sequences. With this setup, we do top-down and bottom-up analyses of how the model verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect''. Bottom-up, we find that ``previous-token heads'' are responsible for self-verification in our setup. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU weights to localize as few as three attention heads that can disable self-verification, pointing to a necessary component of a potentially larger verification circuit. Finally, we verify that similar verification components exist in our base model and a general reasoning DeepSeek-R1 model.

Paper Published: No

Paper Category: Long Paper

Demography: No, I do not identify with any of these affinity groups

Academic: Others

Academic Other: postdoc

Submission Number: 1

Loading