Dissecting Hierarchical Reasoning Models: A Mechanistic Study

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Applications of interpretability, Methods (probing, steering, causal interventions)
Other Keywords: Latent Reasoning, Hierarchical Reasoning Model
TL;DR: We use Mech. interp. to show HRM solves Sudoku by iteratively refining a constraint-aware solution state in its high-level module, but that state is so distributed that probes decode it cleanly while ablating their directions has no causal effect.
Abstract: Machine reasoning at present is largely performed as a process of generating natural language reasoning. The Hierarchical Reasoning Model (HRM) offers an alternative solution for reasoning in the latent space. Despite strong reasoning performance, the internal mechanisms that enable HRM to reason remain underexplored. In this work, we aim to mechanistically understand how HRM reasons and what information it encodes via comparison against baseline models, causal interventions, linear probing with directed ablation, and sparse autoencoders for feature discovery. Our analyses reveal key findings regarding the internal iterative refinement in the latent space of HRM, how HRM encodes information in its hierarchical structure, and why standard interpretability tools fail to understand HRM.
Submission Number: 596
Loading