Keywords: Multi-modal image fusion, Feature disentanglement, Orthogonal feature decomposition, Gram–Schmidt reparameterization, Multi-granularity contrastive learning, Cross-task generalization
Abstract: Multi-modal image fusion suffers from feature entanglement, where modality-specific, content-specific, and task-specific information becomes conflated in unified representation spaces, leading to suboptimal fusion quality and limited generalization. This paper proposes Cross-Modal Feature Disentanglement with Contrastive Task Alignment (CMD-CTA), a framework that addresses this fundamental challenge through mathematically motivated feature separation and semantic alignment, supported by both theoretical analysis under idealized assumptions and empirical evidence on real-world fusion benchmarks. The approach introduces two key innovations: (1) differentiable orthogonal feature decomposition that encourages separation into content, modality, and task subspaces under information-theoretic sufficiency constraints; and (2) contrastive task alignment that establishes semantic bridges through learnable prototypes and multi-granularity contrastive learning. We further adopt hybrid Vision Mamba–Swin backbone to couple linear-complexity long-range modeling with windowed locality, thereby reducing parameters while preserving context. Extensive experiments across six fusion tasks and downstream object detection demonstrate 5.8--7.3\% improvements over state-of-the-art methods, 6.1\% higher mAP\@0.5, and 15.7$\times$ parameter efficiency. This empirically validated framework for representation learning in multi-modal fusion has broad implications for computer vision and autonomous systems.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2630
Loading