Chain-of-Thought Resampling for Interpreting LLM Decision-Making

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought/Reasoning models, AI Safety
Other Keywords: Mechanistic interpretability, interpretability, reasoning models, thinking models, inference‑time scaling, test‑time compute, chain-of-thought, attention, planning, attribution
TL;DR: We apply a sentence-level resampling approach to unpack chain-of-thought reasoning across two case studies, examining the rationale motivating LLM blackmailing and the biased evaluation of job candidates' resumes.
Abstract: The growth of reasoning large language models, which use chain-of-thought to advance their capabilities, presents an opportunity for interpretability. Yet, it remains difficult to establish causally that any one statement in a chain-of-thought is relevant to the model's final answer. We investigated how counterfactual resampling methods can be used to quantify the causal effect of individual reasoning steps and applied these methods to open AI safety questions. Our analyses evaluate the impact of any one sentence within a chain-of-thought on a model's final decision by seeing how the likelihood of that decision changes when a given sentence is resampled and replaced with a sentence that has a different meaning. We applied this strategy in two case studies on safety-related topics. First, we examined model blackmailing, also referred to as "agentic misalignment,'" where the model is argued to be acting unethically to avoid shutdown. However, when we examine the chains-of-thought in this scenario across different models, we find no evidence that self-preservation statements have a substantial impact on their final decision. Second, we examined a more open-ended scenario without any specific hypothesized mechanism, studying how models evaluate job candidates' resumes and how ethnicity/gender biases influence reasoning. This case study targets a single job description and resume in depth while varying the candidate's name to elicit ethnicity- or gender-related biases. Using our resampling technique, we show how the model's principal concern is whether the candidate is overqualified, while statements on the candidate's other traits have little causal effect. Further, biases influence the likelihood of referring to a candidate as overqualified. Our approach to these two case studies can be readily generalized to target other domains and cases, and map out how a model's beliefs are expressed via chain-of-thought. Our resampling framework generally provides a systematic approach for decomposing safety-relevant reasoning and identifying intervention points for alignment.
Submission Number: 302
Loading