Importance of Human Factors in Text-To-Speech EvaluationsDownload PDF

Published: 15 Jun 2023, Last Modified: 27 Jun 2023SSW12Readers: Everyone
Keywords: Text-to-Speech, Subjective evaluations, MOS, Preference Tests, Comparative Tests, Test Reproducibility
TL;DR: In this paper we investigate the variance caused by the human factor in text-to-speech evaluations and show that simple improvements in experiment design can significantly improve the experiment reproducibility with no extra cost.
Abstract: Both mean opinion score (MOS) evaluations and preference tests in text-to-speech are often associated with high rating variance. In this paper we investigate two important factors that affect that variance. One factor is that the variance is coming from how raters are picked for a specific test, and another is the dynamic behavior of individual raters across time. This paper increases the awareness of these issues when designing an evaluation experiment, since the standard confidence interval on the test level cannot incorporate the variance associated with these two factors. We show the impact of the two sources of variance and how they can be mitigated. We demonstrate that simple improvements in experiment design such as using a smaller number of rating tasks per rater can significantly improve the experiment confidence intervals / reproducibility with no extra cost.
5 Replies

Loading