Abstract: Dysarthria in patients post-stroke is often accompanied by central facial paralysis, which impairs facial motor control and emotional expression. Current assessments rely on acoustic modalities, overlooking facial pathological cues and their correlation with emotional expression, which hinders comprehensive disease assessment. To address this issue, we propose a multimodal severity classification framework that integrates facial and acoustic features. Firstly, a multi-level annotation algorithm based on a pre-trained model and motion amplitude was designed to overcome the problem of data scarcity. Secondly, facial topology was modeled using Delaunay triangulation, with spatial relationships captured via graph convolutional networks (GCNs), while abnormal muscle coordination is quantified using facial action units (AUs). Finally, we proposed a multimodal feature set fusion technology framework to achieve the compensation of facial visual features for acoustic modalities and the analysis of disease classification. Our experimental results using the THE-POSSD dataset demonstrate an accuracy of 92.0% and an F1 score of 91.6%, significantly outperforming single-modality baselines. This study reveals the changes in facial movements and sensitive areas of patients under different emotional states, verifies the compensatory ability of visual patterns for auditory patterns, and demonstrates the potential of this multimodal framework for objective assessment and future clinical applications in speech disorders.
External IDs:doi:10.3390/s26041239
Loading