$\texttt{I$^2$MoE}$: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We proposed a Mixture-of-Experts framework for modeling multimodal interactions in a data-driven and interpretable way.
Abstract: Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
Lay Summary: Modern artificial intelligence often works with data from multiple sources, like combining medical images, lab results, and patient records to help doctors make better decisions. But today’s AI models usually integrate this information in a “black box” way: they spit out a final answer, but they do not tell us how different pieces of information interact or which ones matter most. We developed a new system called $\texttt{I$^2$MoE}$ (Interpretable Multimodal Interaction-aware Mixture of Experts) that not only improves how AI combines information from different sources, but also explains what’s going on under the hood. Our model uses specialized “experts” that focus on different types of interactions between data sources, such as how lab results and imaging together affect the diagnosis. It then assigns scores to show which expert matters most for each patient diagnosis. We tested $\texttt{I$^2$MoE}$ on both medical and general datasets and found that it improves performance across tasks. More importantly, it helps researchers and practitioners understand the decision-making process involving multiple data sources, making AI systems more transparent and trustworthy.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Raina-Xin/I2MoE
Primary Area: Deep Learning->Other Representation Learning
Keywords: Multimodal Learning, Mixture of Experts, Biomedical Analysis
Submission Number: 8884
Loading