Keywords: Theory-of-Mind, Large Language Model, Reasoning
Abstract: We present an interactive demo for evaluating whether a large language model
(LLM) “understands” a simple yet strategic environment. We used Rock–Paper
Scissors (RPS) as an example to demonstrate how models can understand and
interact with real-world games under various strategies. Our system lets the LLM
act both as an Observer , who aim to produce a predictive distribution over RPS
outcomes for a given matchup and to recognize how models "think" during games.
Currently, we provide a benchmark including a family of static strategies and
lightweight dynamic strategies with well prompted rules for models. We quantify
alignment between the Observer’s distribution and the ground-truth distribution
induced by three actual aspects using Cross-Entropy, Brier score, Expected Value
(EV) discrepancy to fairly evaluate under their average loss result (Union Loss).
The demo emphasizes interactivity, transparency, and reproducibility and is
able to claim that users can adjust LLM distributions in real time, visualize loss
instantly with the ability to inspect failure modes to understand how LLM "thinks"
during games and evaluate how the LLM’s strategy goes from time to time. We
release implementation details and evaluation scripts for easy reproduction.
Submission Type: Demo Paper (4-9 Pages)
Submission Number: 42
Loading