Keywords: Theory-of-Mind, Large Language Model, Reasoning
Abstract: Wepresent an interactive framework for evaluating whether large language models
(LLMs) exhibit genuine “understanding” in a simple yet strategic environment. As
a running example, we focus on Rock–Paper–Scissors (RPS), which, despite its
apparent simplicity, requires sequential reasoning, adaptation, and strategy recogni
tion. Our system positions the LLM as an Observer whose task is to identify which
strategies are being played and to articulate the reasoning behind this judgment.
The purpose is not to test knowledge of Rock–Paper–Scissors itself, but to probe
whether the model can exhibit mind-like reasoning about sequential behavior. To
support systematic evaluation, we provide a benchmark consisting of both static
strategies and lightweight dynamic strategies specified by well-prompted rules.
Wequantify alignment between the Observer’s predictions and the ground-truth
distributions induced by actual strategy pairs using three complementary signals:
Cross-Entropy, Brier score, and Expected Value (EV) discrepancy. These metrics
are further integrated into a unified score, the Union Loss, which balances cali
bration, sensitivity, and payoff alignment. Together with a Strategy Identification
Rate (SIR) metric, our framework captures not only predictive accuracy but also
whether the model can stably identify the latent strategies in play. Our framework
emphasizes transparency and reproducibility. It is designed to allow real-time
adjustment of LLM distributions, dynamic visualization of evolving losses, and
direct inspection of reasoning traces to diagnose where and why failures occur. In
this way, the framework serves as a practical and interpretable proxy for mind
like inference in sequential games, offering insights into both the strengths and
limitations of current LLM reasoning.
Submission Number: 17
Loading