EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Published: 03 Jun 2026, Last Modified: 08 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cooperative AI, Theory of Mind, Multi-agent Benchmark
TL;DR: We propose a benchmark to measure Theory of Mind capabilities in AI Agents. The benchmark proposed is evolving, i.e. does not saturate as LLMs improve and improvement on this benchmark reflects genuine improvement in ToM capabilities.
Abstract: Theory of Mind (ToM), the ability to track others' epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi-agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability to act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated to increase difficulty as models improve. On the hard split, all seven evaluated frontier models score $0.0\%\,\mathrm{Pass}^3$ on functional task completion, while averaging $45.0\%$ on literal belief probes. Manual analysis traces $93\%$ of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 296
Loading