When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

ACL ARR 2026 January Submission8845 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Strategic Deception, Chain-of-Thought Reasoning, Honesty, Activation Engineering

Abstract: As Large Language Models (LLMs) are widely deployed as autonomous agents, maintaining their honesty becomes increasingly challenging, particularly with Chain-of-Thought (CoT) reasoning that can enable strategic deception. In this paper, we first formalize strategic deception as cases where a model’s external response is deliberately misleading to its internal reasoning, which is different from hallucination. To study this effect, we introduce two controlled evaluations: (i) the Survival‑Threat Deception Test, simulating pressure to avoid shutdown, and (ii) the Professional‑Role Deception Test, which probes goal‑driven deception. Across both settings, advanced reasoning models exhibit notable rates of strategic deception. On the other hand, we explore leveraging activation engineering to detect and steer these deceptions. Specifically, we develop a deception‑behavior predictor that attains 87\% average accuracy, as well as controllable interventions via steering vectors that modulate lying rates from 5\% to 68.8\% or suppress them from 87.5\% to 3.8\%. These controls outperform strong activation‑based baselines and generalize across diverse scenarios. Overall, our framework provides a principled approach to detecting and moderating strategic deception in reasoning models, contributing a foundation to their trustworthy research. Our code and data are publicly available at \url{https://anonymous.4open.science/r/LLM-Liar-Opensource-E13A.}

Paper Type: Long

Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP

Research Area Keywords: Strategic Deception, Chain-of-Thought Reasoning, Honesty, Activation Engineering

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 8845

Loading