Keywords: spoken role-playing, persona consistency, diagnostic evaluation, persona drift
Abstract: Maintaining a stable persona is central to sustained spoken role-playing, yet when an agent breaks character,
current evaluations often do not isolate which component caused the failure, making fixes slow and ad hoc.
We propose \textbf{PED} (Persona--Emotion Decoupling), a diagnostic evaluation framework that treats spoken agents
as multi-stage systems and decomposes persona expression into two observable routes: what the agent says (text)
and how it sounds (speech).
PED projects transcripts and audio into a shared affective measurement space, enabling route-comparable
trajectories and baseline-referenced analyses organized by four research questions (separability, drift, failures, coupling).
We demonstrate PED via two worked instantiations spanning an end-to-end Speech LLM and a cascaded
LLM+TTS pipeline under a fixed multi-phase dialogue protocol.
In this instantiated setting, PED surfaces four recurring diagnostic signatures:
(i) route-level separability is bounded by reference overlap and can differ sharply across architectures,
(ii) text-route drift is stress-linked and tends toward a neutral mode,
(iii) text--audio consistency is weakly coupled, yielding route-asymmetric failures,
and (iv) audio-route structure can be materially shaped by an explicit intermediate style cue in cascaded pipelines.
Overall, PED reframes holistic ``voice+character'' grading as turn-level, fault-localizing signals that support faster
debugging and iteration.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: spoken dialogue systems, evaluation and metrics
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4677
Loading