CausalGame: Benchmarking Causal Thinking of LLM Agents in Games

Zhenhao Chen; Yongqiang Chen; Chenxi Liu; Junchi Yu; Xiangchen Song; Zijian Li; Jialin Li; Philip Torr; Bo Han; Kun Zhang

CausalGame: Benchmarking Causal Thinking of LLM Agents in Games

Zhenhao Chen, Yongqiang Chen, Chenxi Liu, Junchi Yu, Xiangchen Song, Zijian Li, Jialin Li, Philip Torr, Bo Han, Kun Zhang

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agent, Interactive Benchmark, causality

Abstract: Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguish causation from correlation and hidden biases, is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, none of them are designed with the consideration of hidden biases and confounders, that widely exist in real-world scientific discovery. To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 16 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a rigorous assessment of capabilities essential to AI Scientist agents.

Submission Number: 26

Loading