Abstract: Multimodal medical data, such as brain scans and non-imaging clinical records like demographics and neuropsychology examinations, play an important role in diagnosing neurodegenerative disorders, e.g., Alzheimer's disease (AD) and Parkinson's disease (PD). However, the disease-relevant information is overwhelmed by the high-dimensional image scans and the massive non-imaging data, making it a challenging task to fuse multimodal medical inputs efficiently. Recent multimodal learning methods adopt deep encoders to extract features and simple concatenation or alignment techniques for feature fusion, which suffer the representation degeneration issue due to the vast irrelevant information. To address this challenge, we propose a deep self-weighted multimodal relevance weighting approach, which leverages clustering-based constrastive learning and eliminates the intra- and inter-modal irrelevancy. The learned relevance score is integrated as a gate with a multimodal attention transformer to provide an improved fusion for the final diagnosis. Our proposed model, called SMART (Self-weighted Multimodal Attention-and-Relevance gated Transformer), is extensively evaluated on three public AD/PD datasets and achieves state-of-the-art (SOTA) performance in the diagnostics of neurodegenerative disorders. Our source code will be available.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: (1) We propose a novel framework SMART for multimodal neurodegenerative disorder diagnosis. Extensive experiments on three public benchmark datasets for neurodegenerative disorders like AD and PD demonstrate the superiority of our approach over ten baselines, including previous SOTA methods.
(2) We propose a self-weighted multimodal representation learning technique SMRW, which adopts a self-supervised two-level contrastive learning to automatically cluster and weight relevant information at both intra-modal and inter-modal levels. A follow-up relevancy-gated attention module allows an efficient multimodal feature fusion for the final prediction.
(3) Thanks to the relevancy score learned by SMRW, our model is explainable to some extent while having a high diagnostic accuracy. Also, our model is theoretically designed for multiple modalities, which could include more modalities like audio to fully leverage all possible medical information.
Submission Number: 2710
Loading