Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems
Paper Link: https://openreview.net/forum?id=CJCLBc4zVhB
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples from our experiments and the code are available at: https://emtts.github.io/tts-demo/
Presentation Mode: This paper will be presented in person in Seattle
Copyright Consent Signature (type Name Or NA If Not Transferrable): Saiteja Kosgi
Copyright Consent Name And Address: IIIT Hyderabad, Gachibowli, Hyderabad
0 Replies
Loading