Abstract: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code and pretrained models will be released on GitHub.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: * **Added Qualitative Analysis of CMPT Effectiveness:** Included t-SNE visualizations of fused features (**Figures 5 and 10**) and attention map visualizations for the CLS tokens and CMPTs (**Figures 6 and 11**) to analyze model behavior with and without CMPTs under missing modality scenarios, as discussed in **Section 4.6.4** and **Appendix A8**.
* **Added Ablation Study on Alignment Loss Weight λ:** Conducted an ablation study on the alignment loss weight λ across two datasets to justify the choice of λ = 0.20, presented in **Section 4.6.3** with results shown in **Table 5**.
* **Added Ablation Study on LoRA Rank:** Performed an ablation study on LoRA rank r across four datasets, demonstrating that rank 1 offers a good trade-off between performance and efficiency. Details are provided in **Section A4** with results in **Table 7**.
* **Enhanced Failure Mode Discussion:**
* Revised **Section 4.6.2** and **Appendix A7** to more explicitly address failure cases.
* Added qualitative analysis of failure modes in **Section A8**, including t-SNE plots (**Figure 10**) and attention map visualizations (**Figure 11**).
* **Figure and Textual Revisions:**
* Updated **Figure 1** with color-coded arrows to more clearly depict the alignment of $CMPT_1$ with $CLS_2$, and vice versa.
* Clarified that our work focuses on classification tasks in **Section 3.4**, prior to Equation 7.
* Acknowledged Kim & Kim’s comparable performance in **Section 4.3.1**.
* Removed redundant text in **Section 4.3.2**.
* Fixed incorrect value in **Table 2** for UPMC Food-101 dataset (image-missing scenario): corrected from 80.66 to 85.31.
* Discussed MoRA [A] paper in the related work section.
* Made minor textual revisions throughout the manuscript to improve clarity and address reviewer feedback.
[A] MoRA: LoRA Guided Multi-Modal Disease Diagnosis with Missing Modality, MICCAI 2024
Assigned Action Editor: ~Anurag_Arnab1
Submission Number: 4996
Loading