CharacterCraft: Bridging the Literature-Reality Dialogue Gap for Practical Role-Playing Agents

ACL ARR 2025 February Submission7319 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in large language models (LLMs) have given rise to the emergence of role-playing agents (RPAs). The development of high-quality dialogue datasets is critical for advancing RPAs. However, existing datasets have two main issues: (1) the bias between query distributions and real-world user language usage, and (2) the challenge of ensuring responses accurately reflect character traits. To address these issues, we propose CharacterCraft, a novel framework designed for practical RPAs, comprising a tailored Chinese role-playing dataset and a robust evaluation method. First, we develop a specialized model for Chinese dialogue extraction, achieving state-of-the-art performance. Using this model, we then extract a large amount of character dialogue from novels, ensuring high data quality (issue 2). To mitigate the literature-reality dialogue bias in extracted dialogue (issue 1), we introduce an iterative augmentation-reconstruction method, which revises queries to better align with common language usage. Additionally, we propose a context-aware memory retrieval module for fine-grained alignment with the character and introduce a reference-guided LLM-as-a-judge evaluation method for more reliable assessments by comparing their responses to source material dialogues. Our automated pipeline produces a large-scale high-quality Chinese role-playing dataset with 21,392 samples and 121,418 utterances. The experimental results demonstrate the effectiveness of our framework and reveal the limitations of existing RPAs when faced with diverse scenes.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, NLP datasets, automatic creation and evaluation of language resources
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 7319
Loading