Representation-alignment in Theory-of-Mind tasks across Language Models and Agents

Rohini Elora Das; Krissh Bhargava; Krishna Shinde; Rajarshi Das

Representation-alignment in Theory-of-Mind tasks across Language Models and Agents

Rohini Elora Das, Krissh Bhargava, Krishna Shinde, Rajarshi Das

Published: 06 Mar 2025, Last Modified: 04 May 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Domain: cognitive science

Abstract: As AI systems increasingly tackle complex reasoning tasks, understanding how they internally structure mental states is crucial. While prior research has explored representational alignment in vision models, its role in higher-order cognition remains under-examined, particularly in Theory of Mind (ToM) tasks. This study evaluates how AI models encode and compare mental states in ToM tasks, focusing on tasks such as False Belief, Irony, and Faux Pas reasoning. Using a triplet-based similarity framework, we assess whether structured reasoning models (e.g., DeepSeek R1) exhibit better alignment than token-based models like LLaMA. While AI models correctly answer individual ToM queries, they fail to recognize broader conceptual structures, clustering stories by surface-level textual similarity rather than belief-based organization. This misalignment persists across 0th, 1st, and 2nd-order ToM reasoning, highlighting a fundamental gap between human and AI cognition. Moreover, explicit reasoning mechanisms in DeepSeek do not reliably improve alignment, as models struggle to capture hierarchical ToM structures. To further probe this gap, we propose extending representational analysis to temporally evolving, multi-agent belief systems—capturing how beliefs about beliefs shift across time and interaction. Our findings suggest that achieving deeper AI alignment requires moving beyond task accuracy toward developing structured, human-like mental representations. Using triplet-based alignment metrics, we propose a novel approach to quantify AI cognition and guide future improvements in reasoning, interpretability, and social alignment. Additionally, we propose this representational framework as a potential foundation for a noninvasive, scalable cognitive monitoring tool for early-stage dementia or Alzheimer's, analogous to fMRI-based biomarkers but deployable through everyday interactions on mobile platforms.

Submission Number: 63

Loading