Reasoning Models Will Blatantly Lie About Their Reasoning

Reasoning Models Will Blatantly Lie About Their Reasoning

ACL ARR 2026 January Submission2875 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning models, evaluation, chain of thought

Abstract: It has been shown that Large Reasoning Models (LRMs) may not *say what they think*: they do not always volunteer information about how certain parts of the input influence their reasoning process. But it is one thing for a model to *omit* such information and another, worse thing to *lie* about it when asked. Here, we extend the work of Chen et al. (2025) to show that LRMs will do just this: they will flatly deny relying on hints provided in the prompt in answering multiple choice questions—even when directly asked to reflect on unusual (i.e. hinted) prompt content, even when permitted to use these hints, and even though experiments *show* them to be using the hints. We believe our results thus have discouraging implications for CoT monitoring and interpretability.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: explanation faithfulness, data influence, free-text/natural language explanations, counterfactual/contrastive explanations

Contribution Types: Model analysis & interpretability, Reproduction study

Languages Studied: English

Submission Number: 2875

Loading