OGP-Net: Optical Guidance Meets Pixel-Level Contrastive Distillation for Robust Multi-Modal and Missing Modality Segmentation

Aniruddh Sikdar, Jayant Teotia, Suresh Sundaram

Published: 2025, Last Modified: 19 May 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Enhancing the performance of semantic segmentation models with multi-spectral images (RGB-IR) is crucial, particularly for low-light and adverse environments. While multi-modal fusion techniques aim to learn cross-modality features for generating fused images or engage in knowledge distillation, they often treat multi-modal and missing modality scenarios as separate challenges, which is not an optimal approach. To address this, a novel multi-modal fusion approach called Optically-Guided Pixel-level contrastive learning Network (OGP-Net) is proposed, which uses Distillation with Multi-View Contrastive (DMC) and Distillation for Uni-modal Re- tention (DUR) to maintain the correlation between modality-shared and modality-specific features. DMC aligns the uni-modal features by projecting the semantic information across modalities into a unified latent space, ensuring that the feature maps retain multi-modal representations. Pixel-level multi-view contrastive learning is introduced to enable modality-invariant representation learning. To retain modality-specific information, DUR is proposed, which distills detailed textures from RGB images into the optical branch of OGP-Net. Additionally, the Gated Spectral Unit (GSU) is integrated into the framework to eliminate the need for manual tuning and avoid forced feature alignment. Comprehensive experiments show that OGP-Net outperforms state-of-the-art models in multi-modal and missing modality scenarios across three public benchmarking datasets. It achieves quicker convergence and learns efficiently from limited training samples.