SQAT-LD: SPeech Quality Assessment Transformer Utilizing Listener Dependent Modeling for Zero-Shot Out-of-Domain MOS Prediction

Published: 01 Jan 2023, Last Modified: 24 Jul 2025ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this paper, we propose the speech quality assessment transformer utilizing listener dependent modeling (SQAT-LD) mean opinion score (MOS) prediction system, which was submitted to the 2023 VoiceMOS Challenge. The system is based on a combination of self-supervised learning (SSL) models and listener-dependent modeling. Due to this challenge’s emphasis on real-world and challenging zero-shot out-of-domain MOS prediction in three different voice evaluation scenarios, we specifically designed a two-branch module to predict scores and weights for each frame, aiming to achieve better generalization. In the challenge, our system achieved fourth place in Track 1a, second place in Track 1b and first place in Track 2. Additionally, we conducted an ablation study to investigate the effectiveness of our proposed method.
Loading