Representation-alignment in Theory-of-Mind tasks across Language Models and Agents

Published: 06 Mar 2025, Last Modified: 06 Mar 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Domain: cognitive science
Abstract: As AI systems increasingly tackle complex reasoning tasks, understanding how they internally structure mental states is crucial. While prior research has explored representational alignment in vision models, its role in higher-order cognition, particularly Theory of Mind (ToM), remains underexamined. This study evaluates how AI models encode and compare mental states in ToM tasks, focusing on False Belief, Irony, and Faux Pas reasoning. Using a triplet-based similarity framework, we assess whether structured reasoning models (e.g., DeepSeek R1) exhibit better alignment than token-based models like LLaMA. While AI models correctly answer individual ToM queries, they fail to recognize broader conceptual structures, clustering stories by surface-level textual similarity rather than belief-based organization. This misalignment persists across 0th, 1st, and 2nd-order ToM reasoning, highlighting a fundamental gap between human and AI cognition. Moreover, explicit reasoning mechanisms in DeepSeek do not reliably improve alignment, as models struggle to capture hierarchical ToM structures. Our findings suggest that achieving deeper AI alignment requires moving beyond task accuracy toward developing structured, human-like mental representations. By leveraging triplet-based alignment metrics, we propose a novel approach for quantifying AI cognition and guiding future improvements in reasoning, interpretability, and social alignment.
Submission Number: 63
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview