TL;DR: This paper studies Role-Playing Language Agents (RPLAs) for established characters, and presents CoSER, a collection of authentic character datasets for RPLAs, along with open state-of-the-art models and evaluation protocols using such data.
Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively. Our code, dataset and models are available at: https://github.com/Neph0s/CoSER.
Lay Summary: Imagine chatting with AI of Harry Potter or Sherlock Holmes and having them respond exactly as they would in their original stories. While AI chatbots have become advanced, creating authentic character roleplay remains challenging: the lack of high-quality datasets and effective evaluation methods. Current systems often fail to capture the unique personalities, speaking patterns, and depth of beloved fictional characters.
We present CoSER, a comprehensive collection of high-quality dataset, open models, and novel evaluation methods for authentic AI character roleplay. Our dataset is unprecedented in scale and authenticity: we extracted 29,798 real conversations involving 17,966 characters from 771 renowned books. Unlike previous work that used artificially generated dialogues, CoSER provides authentic interactions directly from acclaimed literature, including not just what characters say, but also their experiences, inner thoughts, actions, and the circumstances surrounding each conversation.
We developed "Given-Circumstance Acting" (GCA), a novel approach for both evaluating and training AI character roleplay. For evaluation, GCA creates multi-character conversations where AI models portray different characters in authentic book scenes, then assesses their performance using expert-designed criteria. For training, we use this same approach to develop two state-of-the-art AI models, CoSER-8B and CoSER-70B, which learn to portray multiple characters within scenes while understanding context and relationships. Our models achieve remarkable performance, often surpassing advanced systems like GPT-4o, potentially transforming interactive storytelling, AI companion, and entertainment.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Neph0s/CoSER
Primary Area: Applications->Language, Speech and Dialog
Keywords: Role-Playing Language Models, LLM Persona, Dataset, Evaluation
Submission Number: 3600
Loading