Rethinking Mean Opinion Scores in Speech Quality Assessment: Score Aggregation through Quantized Distribution Fitting
Abstract: This study addresses the task of speech quality assessment (SQA), which aims to automatically predict the subjective quality of a given speech. Recent efforts have focused on training neural-based models to predict the mean opinion score (MOS) of speech samples produced by text-to-speech or voice conversion systems. We aim to enhance the performance of the models by a score aggregation method instead of MOS. The proposed method mitigates the effects of some issues arising from constraints imposed by limited options. Our method assumes annotators internally consider continuous scores and pick the nearest discrete rating. By modeling this process, we approximate the rating distribution by quantizing the latent continuous distribution. We then use the peak of the latent distribution, estimated through the loss between the distribution and actual ratings, as the new value instead of MOS. Experimental results demonstrate that substituting MOSNet’s target with this proposed value improves prediction performance.
Loading