Synthesising turn-taking cues using natural conversational data

Published: 15 Jun 2023, Last Modified: 27 Jun 2023
Keywords: conversational TTS, turn-taking, context-aware TTS
TL;DR: We conditioned a FastPitch model on whether the speaker continued talking or gave up their turn to generate prosodic turn-taking cues.
Abstract: As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance's realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.
