Detecting Motivated Reasoning in the Internal Representations of Language Models

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought/Reasoning models, Probing, Understanding high-level properties of models, AI Safety
TL;DR: motivated reasoning can be detected in internal representations of a language model
Abstract: Large language models (LLMs) sometimes produce chains-of-thought (CoT) that do not faithfully reflect their internal reasoning. In particular, a biased context with a hint can cause a model to change its answer while rationalizing the hinted option without acknowledging its reliance on the hint, a form of unfaithful motivated reasoning. We investigate this phenomenon in the Qwen2.5-7B-Instruct model on the MMLU benchmark and show that motivated reasoning can be detected in the model’s internal representations. We train non-linear probes over the model's residual stream and find that the hinted option is consistently predictable from representations at the end of CoT. Focusing on cases where the model changes its output to the hint without mentioning it, we demonstrate that probes can (i) predict whether the model will follow a hint from its internal representations early in the CoT, and (ii) determine whether a hint-consistent final answer was counterfactually dependent on the hint based on internal representations at the end of CoT.
Submission Number: 275
Loading