Preventing Representation Collapse in Latent Prediction via Context-Conditional Alignment under Missing Modalities

Published: 02 Mar 2026, Last Modified: 14 May 2026ICLR 2026 Re-Align WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: Multimodal models deployed in real-world settings often suffer from missing data modalities due to acquisition costs, privacy constraints, or sensor failures, leading to severe performance degradation. Existing approaches based on shared representations or expert models struggle when modality-specific information is absent. A key challenge is that, when a modality is unobserved, the target representation is not uniquely identifiable, making naive latent prediction prone to degenerate or collapsed solutions. We propose Cross-modal Embedding Prediction and Alignment (CEPA), a framework that addresses missing modalities via adaptive generative imputation directly in latent space. CEPA employs masked representation learning with data-driven masking patterns and a context-conditional distribution alignment objective that stabilizes latent prediction and reconstructability of observed modalities. Experiments on the heterogeneous MIMIC benchmark—spanning EHR time-series, chest X-rays, and clinical notes—show that CEPA consistently outperforms prior methods across four prediction tasks, with ablations confirming the importance of adaptive masking and alignment.
Presenter: ~Jungwon_Choi1
Submission Number: 111
Loading