AI in a vat: Fundamental limits of efficient world modelling for safe agent sandboxing

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: World modelling, POMDP, agent sandboxing, AI safety, AI interpretability
TL;DR: This paper maps some of the fundamental trade-offs when building the world for an AI agent that needs to be tested before deployment.
Abstract: World models provide controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. Unfortunately, the scope and depth of safety assessments can be severely restricted by the computational demands imposed by high-fidelity simulations. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways to simplify world models that remain agnostic to the AI agent under evaluation. Our analysis reveals fundamental trade-offs in the construction of world models related to their computational efficiency and interpretability. We identify procedures to build world models that either minimise memory requirements, delineate the limits of what a capable agent could learn about the world, or enable retrospective analyses to reveal the causes of undesirable outcomes. In doing so, we take a first step toward charting the fundamental limits of agent sandboxing, while establishing a common language bridging reinforcement learning, control theory, and computational mechanics.
Submission Number: 367
Loading