Multimodal Fusion of RGB and Complementary Modalities for Semantic Segmentation

Hu Cao; Yifeng Cheng; Mu He; Mengyu Li; Yinlong Liu; Alois Knoll

Multimodal Fusion of RGB and Complementary Modalities for Semantic Segmentation

Hu Cao, Yifeng Cheng, Mu He, Mengyu Li, Yinlong Liu, Alois Knoll

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Semantic segmentation, multimodal fusion

Abstract: Multi-modal semantic segmentation augments RGB imagery with an auxiliary sensing stream X (RGB+X), e.g., thermal, event, LiDAR, polarization, or light field, to improve robustness under adverse illumination and motion blur. We target two coupled bottlenecks in RGB+X segmentation: selecting the most predictive modality at each location and aligning semantics across different modalities. The proposed framework performs token-wise auxiliary selection to activate a single, reliable auxiliary stream per token and applies style-consistent, polarity-aware cross-modality fusion that transfers auxiliary appearance statistics to RGB features while preserving both supportive and contradictory evidence. We evaluate across five modality pairings: RGB+Thermal, RGB+Event, RGB+LiDAR, RGB+Polarization, and RGB+Light Field and obtain new state of the art on each. Representative results include 76.89% mIoU on MFNet (RGB-Thermal) and 52.54% mIoU on MCubeS (RGB+A+D+N) and other combinations, surpassing recent fusion frameworks under comparable backbones and training protocols. Overall, this selective, alignment-aware fusion design provides a robust path to better RGB+X segmentation without sacrificing efficiency.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24333

Loading