RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

ACL ARR 2025 February Submission8519 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid advancement of Large Language Models in role-playing dialogue, establishing a comprehensive evaluation benchmark about role-playing becomes crucial. Existing methods typically over-focus on the \textit{CHARACTER} and simplify the implicit user intention into "Role-Playing Evaluation". This simplification neglects the user-centric nature of real-world dialogues, leading to bias between evaluation and practical applications. To address this limitation, we introduce RMTBench, a novel user-centric benchmark for role-playing that encompasses 80 diverse characters and more than 8,000 rounds of dialogue data. Unlike previous character-centered evaluation methods that collect dialogues for specific particular dimensions or tasks, RMTBench constructs dialogue based on user-centric scenarios and explores the model performance when the dialogue center shifts from characters to users. Furthermore, we implement a multi-dimensional automatic evaluation system and conduct extensive analysis and experiments. By emphasizing user centrality and multi-dimensional scenarios, RMTBench contributes a significant supplement toward establishing role-playing benchmarks that better align with practical applications. All codes and datasets will be released soon.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,automatic evaluation of datasets
Contribution Types: Data resources, Data analysis
Languages Studied: English,Chinese
Submission Number: 8519
Loading