Keywords: Semantic segmentation, multimodal fusion
Abstract: Multi-modal semantic segmentation augments RGB imagery with an auxiliary sensing stream X (RGB+X), e.g., thermal, event, LiDAR, polarization, or light field, to improve robustness under adverse illumination and motion blur. We target two coupled bottlenecks in RGB+X segmentation: selecting the most predictive modality at each location and aligning semantics across different modalities. The proposed framework performs token-wise auxiliary selection to activate a single, reliable auxiliary stream per token and applies style-consistent, polarity-aware cross-modality fusion that transfers auxiliary appearance statistics to RGB features while preserving both supportive and contradictory evidence. We evaluate across five modality pairings: RGB+Thermal, RGB+Event, RGB+LiDAR, RGB+Polarization, and RGB+Light Field and obtain new state of the art on each. Representative results include 76.89% mIoU on MFNet (RGB-Thermal) and 52.54% mIoU on MCubeS (RGB+A+D+N) and other combinations, surpassing recent fusion frameworks under comparable backbones and training protocols. Overall, this selective, alignment-aware fusion design provides a robust path to better RGB+X segmentation without sacrificing efficiency.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24333
Loading