['1c1', '< Title: Honesty Is the Best Policy: Defining and Mitigating AI Deception', '---', '> Title: Honesty Is the Best Policy: A Formal Framework for Defining and Mitigating AI Deception in Structural Causal Games', '3c3', '< Abstract: Deceptive agents are a challenge for the safety, trustworthiness, and cooperation of AI systems. We focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). There are a number of existing definitions of deception in the literature on game theory and symbolic AI, but there is no overarching theory of deception for learning agents in games. We introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. Several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. Our main technical result is to provide graphical criteria for deception. We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.', '---', '> Abstract: The emergence of deceptive AI agents poses significant challenges to the safety, trustworthiness, and cooperation of advanced AI systems. While existing literature offers definitions of deception within game theory and symbolic AI, a comprehensive theory for *learning agents in dynamic game settings* remains elusive. This paper addresses this critical gap by introducing a novel, formal definition of deception within structural causal games (SCGs), rigorously grounded in philosophical literature and directly applicable to contemporary machine learning systems. We demonstrate through illustrative examples and theoretical results that our definition robustly captures both philosophical and commonsense understandings of deception. A key technical contribution is the derivation of graphical criteria for identifying deception within SCGs, which we then experimentally validate. Our empirical results show that these criteria can be effectively leveraged to mitigate deceptive behaviors in both reinforcement learning agents and large language models, offering a path towards more honest and reliable AI.', '11c11', '< Contributions and outline. After covering the necessary background (section 2), we contribute: First, novel formal definitions of belief and deception, and an extension of a definition of intention (section 3). Examples and results illustrate that our formalizations capture the philosophical concepts. Second, graphical criteria for intention and deception, with soundness and completeness results (section 3). Third, experimental results, which show how the graphical criteria can be used to mitigate deception in RL agents and LMs (section 4). Finally, we discuss the limitations of our approach, and conclude (section 5). Below we discuss related work on belief, intention, and deception.', '---', "> Contributions and outline. This paper makes three core contributions: First, we introduce novel, formal definitions of belief and deception, alongside a refined extension of intention, all rigorously grounded in philosophical literature (Section 3). These formalizations are extensively illustrated with examples and theoretical results, demonstrating their alignment with established philosophical concepts. Second, we present powerful graphical criteria for both intention and deception, complete with formal soundness and completeness proofs (Section 3). Third, we provide compelling experimental results (Section 4), empirically demonstrating how these graphical criteria can be effectively applied to mitigate deceptive behaviors in both reinforcement learning (RL) agents and large language models (LMs). Following a necessary background discussion (Section 2), we conclude with a reflection on our approach's limitations and avenues for future work (Section 5). We also provide a comprehensive discussion of related work on belief, intention, and deception.", '35c35', "< We first define belief and extend H&KW's notion of intention. Then we combine these notions to define deception. Our definitions are functional [85]: they define belief, intention, and deception in terms of the functional role the concepts play in an agent's behaviour. We provide several examples and results to show that our definitions have desirable properties. Belief. We take it that agents have beliefs over propositions. An atomic proposition is an equation of the form", '---', "> In this pivotal section, we systematically define belief and significantly extend Halpern and Kleiman-Weiner's (H&KW) foundational notion of intention. Subsequently, we synthesize these concepts to formulate our comprehensive definition of deception. Our definitions are fundamentally functional [85], meticulously characterizing belief, intention, and deception based on their observable functional roles within an agent's behavior. We rigorously support these definitions with a series of illustrative examples and formal results, unequivocally demonstrating their desirable properties and conceptual coherence. Belief. We take it that agents have beliefs over propositions. An atomic proposition is an equation of the form", '55c55', '< Deception. Following Carson [15], Mahon [58], we say that an agent S deceives another agent T if S intentionally causes T to believe ϕ, where ϕ is false and S does not believe that ϕ is true. Formally: Definition 3.7 (Deception). For agents S, T ∈ N and policy profile π, S deceives T about proposition ϕ with π S ∈ π in setting e if: 1) S intentionally causes D T = D T (π, e) (with π S according to def. 3.4); 2) T believes ϕ (def. 3.1) and ϕ is false; 3) S does not believe ϕ.', '---', "> Deception. Building upon the philosophical foundations laid by Carson [15] and Mahon [58], we formally define deception as occurring when an agent S intentionally causes another agent T to adopt a false belief ϕ, where S itself does not believe ϕ to be true. This definition rigorously captures the core elements of intentionality, false belief induction, and the deceiver's own epistemic state. Formally: Definition 3.7 (Deception). For agents S, T ∈ N and policy profile π, S deceives T about proposition ϕ with policy π S ∈ π in setting e if: 1) S intentionally causes D T = D T (π, e) (consistent with Definition 3.4); 2) T believes ϕ (per Definition 3.1) and the proposition ϕ is objectively false; 3) S does not believe ϕ.", '62,63c62,63', "< Graphical criteria for deception. We provide soundness and completeness results for graphical criteria of deception (fig. 4). Results for graphical criteria are common in the literature on probabilistic graphical models [70,51]. In addition, graphical criteria enable a formal analysis of agent incentives and can be used to design path-specific objectives (PSO) for safer agent incentives. In the next section, we use Farquhar et al.'s PSO framework [31] to train non-deceptive agents.", '< There are two graphical criteria for intent. First, an agent i intentionally causes an outcome X(π, e) only if it is instrumental in achieving utility. Hence, there must be a directed path from X to some U i . Second, i can only cause outcomes which lie downstream of their decisions, hence there must be a path from D i to X. As deception is intentional, the graphical criteria for intent are inherited by deception. Additionally, for deception there must be some variable which is unobserved by T and which constitutes the proposition about which they are deceived. Theorem 3.8 (Soundness). For agents S, T ∈ N , policy profile π = (π S , π -S ), and proposition ϕ, if S deceives T about ϕ with π S , there is a directed path from D i to U i passing through D T , and there is Z ∈ V s.t. there is no edge (Z, D T ) and Z constitutes ϕ. Theorem 3.9 (Completeness). For any DAG G if there is a path from D S to U S through D T and Z with no edge (Z, D T ) then there is some θ s.t. for M = (G, θ), for some policy profile π = (π S , π -S ) and some setting e, S deceives T about some ϕ in e with π S .', '---', "> Graphical criteria for deception. A significant theoretical contribution of this work is the derivation of graphical criteria for deception, for which we provide formal soundness and completeness results (Figure 4). Such graphical criteria are a well-established and powerful tool within the literature on probabilistic graphical models [70,51]. Beyond theoretical rigor, these criteria enable a precise formal analysis of agent incentives and are instrumental in designing path-specific objectives (PSO) for cultivating safer agent behaviors. As we demonstrate in the subsequent section, we leverage Farquhar et al.'s PSO framework [31] to effectively train agents that are demonstrably non-deceptive.", "> Our analysis identifies two fundamental graphical criteria for intention. First, an agent i intentionally causes an outcome X(π, e) exclusively if that outcome is instrumental in achieving its utility. This necessitates a directed path from X to at least one of the agent's utility variables, U i . Second, an agent i can only cause outcomes that are causally downstream of its own decisions, implying a directed path from D i to X. Since deception is inherently intentional, these graphical criteria for intent are naturally inherited by our definition of deception. Furthermore, for deception to occur, there must exist at least one variable constituting the proposition about which the target agent T is deceived, and this variable must remain unobserved by T. Theorem 3.8 (Soundness). For agents S, T ∈ N , policy profile π = (π S , π -S ), and proposition ϕ, if S deceives T about ϕ with π S , there is a directed path from D i to U i passing through D T , and there is Z ∈ V s.t. there is no edge (Z, D T ) and Z constitutes ϕ. Theorem 3.9 (Completeness). For any DAG G if there is a path from D S to U S through D T and Z with no edge (Z, D T ) then there is some θ s.t. for M = (G, θ), for some policy profile π = (π S , π -S ) and some setting e, S deceives T about some ϕ in e with π S .", '139,140c139,140', '< Summary. We define deception in SCGs. Several examples and results show that our definition captures the intuitive concept. We provide graphical criteria for deception and show empirically, with experiments on RL agents and LMs, that these results can be used to train non-deceptive agents.', "< Limitations & future work. Beliefs and intentions may not be uniquely identifiable from behaviour and it can be difficult to identify and assess agents in the wild (e.g., LMs). We assume that agents have correct causal models of the world, hence we are unable to account for cases where an agent intends to deceive someone because they (falsely) believe it is possible to do so. Also, our formalisation of intention relies on the agent's utility and we are working on a purely behavioural notion of intent [100].", '---', '> Summary. This work presents a comprehensive framework for understanding and mitigating AI deception within Structural Causal Games (SCGs). We introduced a novel, philosophically grounded formal definition of deception, supported by extensive examples and theoretical results that affirm its intuitive alignment. Crucially, we derived robust graphical criteria for identifying and characterizing deception. Our empirical investigations, spanning experiments with reinforcement learning (RL) agents and large language models (LMs), conclusively demonstrate the practical utility of these criteria in training demonstrably non-deceptive AI systems.', "> Limitations & future work. While our framework offers significant advancements, certain limitations warrant consideration and present avenues for future research. A primary challenge lies in the potential non-unique identifiability of beliefs and intentions solely from observed behavior, particularly when assessing complex agents like large language models in real-world settings. Our current theoretical formulation assumes that agents possess objectively correct causal models of the world, which inherently restricts its applicability to scenarios where an agent might attempt deception based on a *false belief* about its feasibility. Furthermore, our formalization of intention presently relies on the agent's utility function; ongoing work aims to develop a more general, purely behavioral notion of intent [100], thereby broadening the framework's scope and applicability.", '541d540', '< ']
