Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

ACL ARR 2026 January Submission5185 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: llm as a judge, agent, sentient agent, social cognition
Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge(SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a ``Sentient Agent'' -- an LLM-powered agent that simulates human-like emotional changes and inner thoughts to provide a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. Human evaluation further demonstrates 85.3\% consistency between the agent's emotional reasoning and human judgments. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4$\times$) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g. Arena). SAGE thus provides a principled, scalable, and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Dialogue and Interactive Systems, Language Modeling
Languages Studied: English
Submission Number: 5185
Loading