Beyond Behavioural Evaluations for Assessing World Models

Published: 10 Jun 2025, Last Modified: 14 Jul 2025ICML 2025 World Models WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: world models, behavioural evaluations, mechanistic interpretability, capability elicitation
Abstract: To predict the future capabilities of agentic systems, it’s useful to understand the extent to which foundation models have internal world models. Agents with robust internal world models generalise better to unseen and out-of-distribution data. Interpretability evaluations suggest that transformers trained on tasks like Othello have robust world models. However, behavioural evaluations question whether these agents truly have world models as robust as the interpretability research indicates. We argue that to claim that an ML model doesn’t have a robust world model, practitioners must either use Interpretability evaluations or provide an argument for why their behavioural evaluations are fully elicitating the capabilities of the model. We hence propose a protocol for combining Evaluations and Elicitation to assess the world models of frontier AI systems.
Submission Number: 41
Loading