Thought Anchors: Which LLM Reasoning Steps Matter?

ICLR 2026 Conference Submission17602 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic interpretability, interpretability, reasoning models, thinking models, inference‑time scaling, test‑time compute, chain-of-thought, attention, planning, attribution
TL;DR: We introduce black and white box methods for interpreting reasoning LLMs' chain-of-thought at the sentence level, identifying and categorizing sentences with outsized effects on the model's final answer and mapping sentence-sentence dependencies
Abstract: Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. To ensure their safety, it is crucial that we can interpret their computations. Yet, many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence's counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence's impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences thought anchors. For example, seemingly correct reasoning traces can contain steps with significant negative causal impact that would cause the model to pursue an incorrect answer if not immediately corrected. These thought anchors are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within the reasoning trace gives insight into the models behavior. Such information can be used to predict a problem's difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace's structure. We provide an open-source tool (anonymous-interface.com/ta) for visualizing the outputs of our methods on further problems. The identified convergence across methods and consistencies in motifs across analyses demonstrate the potential of sentence-level analysis for a deeper understanding of reasoning models.
Primary Area: interpretability and explainable AI
Submission Number: 17602
Loading