Re-examining the quality dimensions of synthetic speechDownload PDF

Published: 15 Jun 2023, Last Modified: 28 Jun 2023SSW12Readers: Everyone
Keywords: speech synthesis evaluation, factor analysis, speech quality
TL;DR: Re-examination of underlying items for evaluating synthetic speech quality with factor analysis and acoustic correlation.
Abstract: The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.
3 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview