Keywords: Spoken Dialogue System, Singing Voice Synthesis, Large Language Models, Speech-to-Singing, Interactive Roleplay
Abstract: With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR–LLM–SVS pipeline and supports a wide range of modular configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, with modular open-source code supporting customization and extension. The code, video materials, and Hugging Face demo page will be made publicly accessible after acceptance.
Supplementary Material: zip
Submission Number: 9
Loading