Keywords: llm as a judge, agent, sentient agent, social cognition
Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge.
To bridge the gap, we introduce Sentient Agent as a Judge(SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition.
SAGE instantiates a ``Sentient Agent'' -- an LLM-powered agent that simulates human-like emotional changes and inner thoughts to provide a more realistic evaluation of the tested model in multi-turn conversations.
At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.
Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. Human evaluation further demonstrates 85.3\% consistency between the agent's emotional reasoning and human judgments. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4$\times$) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g. Arena). SAGE thus provides a principled, scalable, and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Dialogue and Interactive Systems, Language Modeling
Languages Studied: English
Submission Number: 5185
Loading