Keywords: Medical Imaging, Multimodal Fusion, Attention, Deep Learning
Abstract: Multimodal fusion learning (MFL) paradigm (a framework to jointly learn from heterogeneous data sources) has shown great potential in various fields such as Medicine, Science, Engineering, etc. It is extremely desirable in the medical domain, where we are faced with disparate data modalities such as imaging, clinical records, and omics. However, existing MFL strategies face several major challenges. First, they struggle to capture complex cross-modal interactions effectively. Second, they are often designed and evaluated for narrow, fixed modality configurations (e.g., imaging-only, or specific pairs such as image and omics or image and clinical text), which limits evidence of their adaptability and generalizability to broader collections of heterogeneous medical modalities. Finally, they incur high computational costs, restricting their applicability in resource-constrained healthcare AI. To address these challenges, we propose a novel MFL framework – Efficient Hybrid-fusion Physics-inspired Attention Learning Network (EHPAL-Net) – a lightweight and scalable framework that integrates various modalities through novel Efficient Hybrid Fusion (EHF) layers. Each EHF layer captures rich modality-specific multi-scale spatial information, followed by a Physics-inspired Cross-modal Fusion Attention module to model fine-grained, structure-preserving cross-modal interactions, thereby learning robust complementary shared representations. Furthermore, EHF layers are sequentially learned for each modality, making them adaptable and generalizable. Extensive evaluations on 15 public datasets show that EHPAL-Net outperforms leading multimodal fusion methods, boosting performance by up to 3.97% and lowering computational costs by up to 87.8%, ensuring more effective and reliable predictions.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20215
Loading