ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

ACL ARR 2026 January Submission1279 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Role-Playing, Role-Playing, Large Language-Audio Model (LLAM), Text-to-Speech Synthesis (TTS)

Abstract: Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through **ActorMindBench**, and we present a corresponding reasoning framework, called **ActorMind**. Specifically, (1) **Speech Role-Playing** enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) **ActorMindBench** is a hierarchical benchmark comprises _Utterance-Level_ content with 7,653 utterances, _Scene-Level_ content with 313 scenes, and _Role-Level_ content with 6 roles. Notably, we provide the corresponding data construction pipeline to facilitate user expansion. (3) **ActorMind** is an off-the-shelf, multi-agent CoT style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing. The project page is available at https://github.com/*********.

Paper Type: Long

Research Area: Speech Processing and Spoken Language Understanding

Research Area Keywords: Text-to-Speech, multimodality,

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 1279

Loading