Keywords: large-language-models, reasoning, faithfulness, interpretability, motivated-reasoning
TL;DR: motivated reasoning can be detected in internal representations of a language model
Abstract: Large language models (LLMs) sometimes produce chains-of-thought (CoT) that do not faithfully explain their internal reasoning. In particular, a biased context can cause a model to change its answer while rationalizing it without acknowledging its reliance on the bias, a form of unfaithful motivated reasoning. We investigate this phenomenon across families of LLMs on reasoning benchmarks and show that motivated reasoning is reflected in their internal representations. Training non-linear probes over the residual stream, we find that the bias is perfectly recoverable from representations at the end of CoT, even when the model neither adopts it nor mentions it. Focusing on such cases where the bias is not mentioned, we further show that probes can reliably (i) predict early in the CoT whether the model will ultimately follow a bias, and (ii) distinguish at the end of the CoT whether a bias-consistent answer is driven by the bias or would have been chosen regardless. These results demonstrate that internal representations reveal motivated reasoning beyond what is visible from CoT explanations.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22247
Loading