The Limits of Predicting Agents from Behaviour

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show that the external behaviour of AI systems, i.e., their interactions in an environment, partially determines their ``beliefs'' and future behaviour.
Abstract: As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.
Lay Summary: An important consideration to safely interact with AI systems is to form expectations as to how they might act in the future. In this paper, we answer this question under the assumption that the agent’s behaviour is guided by a world model. This is assumption is useful for prediction because it implies a consistency in behaviour that can in principle be exploited to infer the AI's choice of action in novel (unseen) environments. We show that the AI's choice of action can be partially determined, meaning that some actions might be ruled out while others might not, just by using the AI's history of decisions in related environments. Our results give theoretical limits on how much can be known about AI preferences just from data on their past behaviour.
Primary Area: Social Aspects->Safety
Keywords: Safety, Alignment, Causality, Partial Identification.
Submission Number: 3750
Loading