PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

ACL ARR 2025 February Submission93 Authors

02 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: evaluation and metrics,role-playing,benchmark

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English,Russian

Submission Number: 93

Loading