Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS

Harm Lameris; Ambika Kirkland; Joakim Gustafson; Eva Szekely

Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS

Harm Lameris, Ambika Kirkland, Joakim Gustafson, Eva Szekely

Published: 15 Jun 2023, Last Modified: 16 Jun 2023SSW12Readers: Everyone

Keywords: speech synthesis, text to speech, evaluation, social, context

TL;DR: We compare the effect of introducing situational and preceding context in speech synthesis evaluation.

Abstract: Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied in social contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.

3 Replies

Loading