Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges

Anuradha Welivita; Fawzia Zeitoun; Pearl Pu

Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges

Anuradha Welivita, Fawzia Zeitoun, Pearl Pu

Published: 18 May 2026, Last Modified: 18 May 2026CoNLL 2026 ArchivalEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Empathy, Empathetic response generation, Large Language Models (LLMs), Human vs AI comparison, LLM-as-a-judge, Human evaluation, Affective computing

TL;DR: In a large-scale study with 1,000 human raters and an LLM judge, LLM responses were consistently rated as more empathetic than human-written ones, though both humans and the LLM judge showed self-favoring bias.

Abstract: This paper compares the empathetic quality of responses generated by humans and large language models (LLMs). We evaluate four LLMs that were widely used at the time of study—GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct—against a human baseline using a large-scale between-subjects study. A total of 1,000 human participants evaluated the empathetic quality of human- and LLM-generated responses to 2,000 dialogue prompts spanning 32 positive and negative emotions. To complement human judgments, we also employed an LLM-as-judge (GPT-4o-mini) to assess the same responses. Across emotions and evaluators, LLM-generated responses were rated as significantly more empathetic than human-written responses. We also observed that both human judges and the LLM-as-judge tended to rate responses generated by their own group more favorably, indicating self-favoring tendencies. These findings highlight both the strong performance of contemporary LLMs in empathetic responding and the need to interpret human- and LLM-based evaluations with care.

Supplementary Material: zip

Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.

Primary Area Selection: Interaction and Dialogue

Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript

Data Collection From Human Subjects: Yes, with details included in the main paper or in an appendix on (1) how the data was obtained (2) how participants were recruited and paid (3) how consent was obtained (4) whether a IRB protocol was approved for this study. Note that providing this information is obligatory.

Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.

Submission Number: 129

Loading