Abstract: Speech quality assessment serves as an important tool for speech related applications. In this study, we propose a non-intrusive model QUAL-Net, which is able to estimate subjective quality scores of the target speech. QUAL-Net combines acoustic features extracted by a large-scale model Whisper with spectral features and time-domain waveform features. Furthermore, QUAL-Net employs a CNN-BiLSTM-Attention architecture and introduces multi-head attention mechanism into attention layer to enhance model’s performance. Experimental results demonstrate that Whisper embedding features have more powerful speech quality characterization ability than other self-supervised learning (SSL) embedding features. Additionally, the feature combination utilizing all three types of acoustic features obtains optimal improvement in model performance. Furthermore, the results prove that multi-head attention has potential to capture more key information from acoustic features than multiplicative self-attention. We tested QUAL-Net’s performance on the noisy and enhanced track of VoiceMOS Challenge 2023. Compared with MOSA-Net and other speech quality assessment models, QUAL-Net achieves significant improvement when it is trained to estimate subjective quality scores. QUAL-Net outperforms the top-ranked MOSA-Net+ in all evaluation metrics. QUAL-Net uses a simpler CNN architecture compared to the MOSA-Net+, contributing to reduction of the model complexity.
Loading