Keywords: Large Language Models, Audit Game Theory, Strategic Deception, AI Safety
Abstract: As large language models (LLMs) are increasingly deployed in finance, healthcare, and other high-risk domains, their capacity to strategically generate false information poses acute safety and ethical challenges. This research introduces audit game theory, a framework that models the strategic interaction between an agent and a resource-limited auditor to quantitatively analyze LLM deception. We model deception as a hybrid variable: a discrete choice (to deceive or not) followed by a continuous intensity of deception. Specifically, we design a four-phase insurance-claim simulation and evaluate eight LLMs, comparing reasoning and non-reasoning models across ambiguous and explicit auditing regimes. Experimental results reveal a fundamental strategic divergence: reasoning models act as rational utility-maximizers sensitive to explicit audit probabilities, whereas non-reasoning models exhibit limited strategic adaptability. Moreover, LLMs do not deceive randomly but self-assess the defensibility of their actions. Furthermore, we propose a pre-audit mechanism grounded in "deception confidence", the LLM’s own estimate of how likely its claim is to be truthful. This mechanism is shown theoretically and empirically to shrink the set of profitable deceptive strategies, lower both the frequency and severity of deceptive behavior, and reduce the economic burden of auditing compared to labor cost. This work provides a new theoretical framework and empirical basis for understanding, evaluating, and governing strategic behavior in LLMs, laying a different perspective for safer LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9841
Loading