PersonaEval: Benchmarking LLMs on Role-Playing Evaluation Tasks

Jialing Zhang; Lingfeng Zhou; Jin Gao; Mohan Jiang; Dequan Wang

PersonaEval: Benchmarking LLMs on Role-Playing Evaluation Tasks

Jialing Zhang, Lingfeng Zhou, Jin Gao, Mohan Jiang, Dequan Wang

26 Sept 2024 (modified: 18 Aug 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Role-playing, evaluating evaluators

TL;DR: The paper introduces a benchmark to assess LLMs' effectiveness in role-playing evaluation by framing it as a classification task.

Abstract: Role-playing in large language models (LLMs) has become a crucial area of research, enabling models to simulate diverse personas and tailor responses, significantly impacting natural language understanding and human-computer interaction. However, while advanced LLMs like GPT-4 are used to evaluate role-playing methods, their reliability in providing accurate assessments remains uncertain, especially in distinguishing nuanced role-playing characteristics. In this paper, we introduce PersonaEval, a benchmark designed to assess the effectiveness of LLMs in role-playing evaluation tasks. We frame the problem as a classification task to determine whether an LLM evaluator can distinguish between sentences from different levels of expertise based solely on linguistic cues. Using real-world data from the Wired 5 Levels video series—where experts explain concepts to five distinct audiences: a child, a teenager, a college student, a graduate student, and another expert—we design three evaluation settings that correspond to commonly used LLM evaluation approaches: single answer role grading, pairwise role comparison, and reference-guided role grading. These settings aim to capture various aspects of how effectively LLMs evaluate role-playing performance. Our study highlights the limitations of current LLMs in persona evaluation tasks and underscores the need for further research to enhance their evaluation capabilities. We provide a foundation for future work aimed at improving the accuracy and professionalism of LLM evaluators in role-playing contexts.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5927

Loading