Improve Singing Quality Prediction Using Self-supervised Transfer Learning and Human Perception FeedbackDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 02 Apr 2024MMAsia 2023Readers: Everyone
Abstract: The scarcity of expansive datasets for singing quality assessment makes the utilization of complex deep learning methods a considerable challenge. This research presents a method to improve the singing quality prediction based on the feedback from subjective human perception opinion that is learned by the transfer learning methods of self-supervised learning (SSL) speech models. In combination with the CRNN_PH model as the baseline model, the SSL models are integrated into two distinct major architectures: one directly draws features from the pre-trained SSL model (CRNN_PH+SSL), and the other employs the weighted sum (WS) of the output features from different transformer blocks in the SSL model (CRNN_PH+SSL_WS). We conducted comparative experiments on pre-trained SSL models, five on wav2vec 2.0 (W2V2) and two on HuBERT, which were trained over various datasets. It turns out that CRNN_PH+W2V2_base_WS is improved the most on singing quality score prediction that is closely aligning with subjective human perceptions in terms of correlation coefficients and MSE with respect to the ground truth.
0 Replies

Loading