SelfMask: Cross-modal Self-Masking for Multimodal Representation Learning in Missing Modality Scenarios

SelfMask: Cross-modal Self-Masking for Multimodal Representation Learning in Missing Modality Scenarios

ICLR 2026 Conference Submission25120 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal learning, Missing modality, Self-supervised learning, Representation-level imputation, Cross-modal masking

TL;DR: SelfMask improves robustness under missing-modality inputs by learning representation-level imputation and a context-aware masking policy, trained with cycle-consistent self-supervision.

Abstract: Multimodal learning promises to harness complementary information across diverse modalities, yet real-world deployments often face missing modalities due to acquisition costs, privacy constraints, or data corruption, leading to substantial performance degradation. We present \ours, a framework for learning robust representations in the presence of incomplete multimodal data. During training, \ours\ imputes missing modality representations through a masked representation learning scheme with adaptive masking, where informative masks are learned from data rather than sampled at random. To guide the imputation without relying on unavailable ground-truth for missing modalities, we introduce a cross-modal consistency loss: predicted representations of missing modalities are required not only to align with semantic content but also to support the reconstruction of observed ones. This consistency-based objective encourages robust, semantically grounded representations. Experiments on MIMIC-IV and CMU-MOSEI demonstrate that \ours\ consistently improves resilience and predictive accuracy under diverse missing-modality scenarios. Ablation studies further show that our learned masks outperform conventional random masking, yielding more reliable cross-modal representations. Our framework is broadly applicable across multimodal domains, offering a practical solution for real-world settings where incomplete modalities are the norm.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 25120

Loading