Keywords: Audio Role-Playing, Large Language Models, Multimodal Dataset, Data Construction
Abstract: While existing role-playing research predominantly focuses on text, Audio Role-Playing (ARP) presents unique challenges regarding the synchronized alignment of semantic content and vocal characteristics. To address this gap, we propose AudioRole, a meticulously curated dataset from 13 TV series spanning 1K+ hours with 1M+ character-grounded dialogues, providing synchronized audio-text pairs annotated with speaker identities and contextual metadata. In addition, to demonstrate the effectiveness of the dataset, we introduced ARP-Eval, a dual-aspect evaluation framework that assesses both response quality and role fidelity. Empirical validation showing GLM-4-Voice trained on AudioRole (called ARP-Model) achieves an average Acoustic Personalization score of 0.31, significantly outperforming the original GLM-4-voice and the more powerful model MiniCPM-O-2.6. The ARP-Model also achieves a Content Personalization score of 0.36, surpassing the untrained original model by about 38%. The blind human perceptual evaluation also confirms these findings.
AudioRole features dialogues from over 115 main characters, 6 trained ARP-Models, and evaluation protocols. Together, they provide an essential resource for advancing audio-grounded role-playing research.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Speech Recognition, Text-to-Speech and Spoken Language Understanding
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 5947
Loading