Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation

Ambika Kirkland; Shivam Mehta; Harm Lameris; Gustav Eje Henter; Eva Szekely; Joakim Gustafson

Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation

Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely, Joakim Gustafson

Published: 15 Jun 2023, Last Modified: 28 Jun 2023SSW12Readers: Everyone

Keywords: speech synthesis, TTS evaluation, mean opinion score, text-to-speech, neural TTS

TL;DR: We show that people underreport and are inconsistent about how they do MOS tests and this can impact the results of TTS evaluations.

Abstract: The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.

3 Replies

Loading