Thought Anchors: Which LLM Reasoning Steps Matter?

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought/Reasoning models, Interpretability tooling and software
Other Keywords: Mechanistic interpretability, interpretability, reasoning models, thinking models, inference‑time scaling, test‑time compute, chain-of-thought, attention, planning, attribution
TL;DR: We introduce black and white box methods for interpreting reasoning LLMs' chain-of-thought at the sentence level, identifying and categorizing sentences with outsized effects on the model's final answer and mapping sentence-sentence dependencies.
Abstract: Reasoning large language models have achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to interpreting reasoning. We present three complementary attribution methods: (1) a black-box method measuring each sentence's counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified "broadcasting" sentences receiving high attention from all future sentences via "receiver" attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on future sentences' tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that disproportionately influence the reasoning trajectory. These thought anchors are usually planning or backtracking sentences. We provide an open-source tool for visualizing our methods' outputs (anonymous-interface.com) and present a case study showing converging patterns across methods, which together map the model's multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.
Submission Number: 78
Loading