Keywords: Cooperative AI, Theory of Mind, Multi-agent Benchmark
TL;DR: We propose a benchmark to measure Theory of Mind capabilities in AI Agents. The benchmark proposed is evolving, i.e. does not saturate as LLMs improve and improvement on this benchmark reflects genuine improvement in ToM capabilities.
Abstract: Theory of Mind (ToM), the ability to track others' epistemic state, makes humans efficient collaborators.
AI agents need the same capacity in multi-agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief
questions.
The ability to act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested.
We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private
information, and constrained communication.
Each task is formally verified for solvability and required epistemic depth, and new tasks are generated to increase difficulty as models
improve.
On the hard split, all seven evaluated frontier models score $0.0\%\,\mathrm{Pass}^3$ on functional task completion, while averaging
$45.0\%$ on literal belief probes.
Manual analysis traces $93\%$ of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner
constraints, and misallocated messages, providing a concrete target for future work.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 296
Loading