Keywords: Mechanistic interpretability, interpretability, reasoning models, thinking models, chain-of-thought, on-policy, causal interventions, agentic misalignment, unfaithfulness, inference‑time scaling, test‑time compute
TL;DR: Resampling chain-of-thought to study trajectory distributions rather than single rollouts enables reliable causal analysis, clearer narratives of model reasoning, and principled methods for interventions in reasoning models.
Abstract: We argue that interpreting reasoning models from a single chain-of-thought (CoT) is fundamentally inadequate. To understand computation and causal influence, one must study reasoning as a distribution of possible trajectories elicited by a given prompt. We approximate this distribution via on-policy resampling and use it to answer concrete questions about the causes of model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In agentic misalignment scenarios where models seemingly blackmail to preserve themselves, we resample specific sentences to measure their downstream effects. We find that normative self-preservation sentences have unusually small and non-resilient causal impact on the final decision across models, indicating they are not a meaningful driver of blackmail. Second, are handwritten edits to CoT sufficient for steering reasoning? We find that off-policy sentence insertions common in earlier literature yield small and unstable effects in decision-making tasks, whereas on-policy resampling produces larger and more consistent effects. Third, how do we attribute causal influence when models modify their plans or correct prior errors during reasoning? We introduce a resilience metric and counterfactual importance that repeatedly resample to remove sentences such that similar content doesn't reappear downstream. Critical planning statements resist removal but have large effects when successfully eliminated. Fourth, what can our methods, which focus on the mechanistic roles of CoT, teach us about unfaithful reasoning? Adapting causal mediation analysis, we edit hint pathways mid-trajectory and find that prompt hints exert smooth and cumulative influences rather than single-step pivots. Hidden information can influence the trajectory of reasoning by shifting what decisions are made at different junctures in a CoT, and these biases can be modeled and quantified with resampling. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled guidance for CoT interventions.
Primary Area: interpretability and explainable AI
Submission Number: 21061
Loading