Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech
Keywords: Dysarthria, speech quality assessment, data augmentation, weakly-supervised, contrastive learning
TL;DR: Data augmentation via pseudo-labeling and contrastive pretraining improves cross-domain dysarthric speech quality assessment.
Abstract: Dysarthria is a motor speech disorder caused by neurological impairments, resulting in significant degradation of acoustic and perceptual speech characteristics. Accurate dysarthric speech quality assessment (DSQA) is critical for clinical diagnosis, rehabilitation monitoring, and development of inclusive speech technologies. Currently, DSQA relies on expert evaluation by speech-language pathologists (SLPs), which limits its scalability. Automated DSQA would complement clinical expertise by enabling continuous, objective monitoring.
Non-intrusive speech quality assessment (NI-SQA) models estimate quality directly from the input signal. Models such as DNSMOS and UTMOS succeed in clean and noisy conditions, while SpICE and related work extend this to dysarthric speech using large-scale corpora. However, labeled dysarthric data remains scarce: in the Speech Accessibility Project (SAP), only ~30 of 400+ utterances per speaker are SLP-rated, limiting model robustness---particularly across unseen domains and languages.
To fully leverage the large unlabeled portion of SAP alongside the labeled subset, we propose a three-stage framework. First, a Whisper-large regression model trained on labeled SAP data generates pseudo-labels for unlabeled samples. Second, pseudo-labeled SAP and LibriSpeech (assigned healthy label) are used for weakly-supervised contrastive pretraining, improving representation quality and speaker diversity. Third, the pretrained encoder is fine-tuned on labeled SAP data for the regression task. Our pairing strategies for contrastive learning---discrete, continuous, and binary---are illustrated in Figure 1.
We evaluate on diverse cross-domain dysarthric corpora spanning multiple etiologies and languages: UASpeech, DysArinVox, EasyCall, EWA-DB, and NeuroVoz. Our baseline achieves a speaker-level SRCC of 0.732 on cross-domain sets; the proposed framework improves this to 0.761 while preserving SAP performance (utterance-level SRCC 0.719). Ablations confirm that weak supervision is essential for harmonizing SAP and LibriSpeech, and that LibriSpeech integration substantially improves cross-domain robustness.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 113
Loading