Keywords: Foundation model, Stroke, Speech, Clinical Assessment
TL;DR: Our framework using Whisper embeddings with linguistic, glottal, and acoustic features achieve 92.4% of accuracy in stroke detection and N-RMSE of 0.1299 for severity prediction, showing its potential for improving diagnosis and patient outcome
Abstract: Post-stroke language impairments affect speech and language production, leading to lexical, semantic, syntactic, and articulatory-prosodic deficits. These disruptions extend from impaired cognitive-motor planning to execution, manifesting as altered vocal fold dynamics that compromise speech fluency and intelligibility. The high-dimensional and multimodal nature of these impairments poses significant challenges to traditional assessment methods, necessitating automated solutions that can capture the heterogeneity of disfluencies. We present a multimodal framework that integrates foundation model embeddings with clinically-guided features for robust speech assessment. Leveraging our purpose-built database of 600 post-stroke patients, we fine-tune Whisper to extract encoder embeddings that capture pathological speech characteristics.
These representations are integrated with linguistic complexity metrics, physiological glottal parameters, and acoustic features through neural networks. Our model achieves 92.4% classification accuracy in stroke detection, outperforming feature-based methods, with SHAP analysis validating the modality-specific importance. We further demonstrate real-word clinical utility through severity prediction on Comprehensive Aphasia Test (CAT) scores, achieving an N-RMSE of 0.1299. This framework establishes a clinically relevant approach for integrating speech representations with domain-specific biomarkers to potentially support diagnosis, severity tracking, and precision rehabilitation strategies.
Submission Number: 54
Loading