Evaluating the Robustness of Speech Evaluation Standards for the CrowdDownload PDFOpen Website

2022 (modified: 08 Nov 2022)QoMEX 2022Readers: Everyone
Abstract: Subjective assessments are a key component of speech quality research. Traditionally, the assessments are conducted in laboratories in controlled conditions and following international standards like ITU-T Rec.P.800. However, even before the current pandemic, more speech quality research used crowdsourcing-based approaches for collecting subjective ratings. Crowdsourcing allows researchers to collect data even without a dedicated test laboratory, to collect data from a huge and diverse group of participants, and to perform the assessment in various real-life settings. Still, this approach raises questions about the reliability and validity of the subjective ratings, especially when comparing the ratings with data collected in standardized procedures. One step to approach these challenges was the development of the ITU-T Rec.P.808 standard. This standard helps practitioners implement best practices from speech quality studies and crowdsourcing studies in their crowdsourced speech quality assessments. However, even with the ITU-T Rec.P.808 in action, it is unclear how much background knowledge is necessary to successfully “implement” this standard. Therefore, this paper aims to assess the data quality differences between two P.808 implementations. One implementation is from a co-author of the P.808 standard, and the other is a researcher with only a little background in crowdsourcing and speech quality assessments. Both implementations are used in a large-scale crowdsourcing study with about two hundred users from Amazon Mechanical Turk. The collected ratings are compared to gold-standard data from a certified laboratory. Also, the two implementations are compared to analyze whether they lead to the same conclusions. The results show that both implementations correlate strongly with the laboratory and with each other. Thus, suggesting that the ITU-T Rec.P.808 is robust enough to be implemented by non-experts in speech evaluation or crowdsourcing.
0 Replies

Loading