SegRGB-X: General RGB-X Semantic Segmentation Model

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: General model, Semantic segmentation, multimodal fusion, vision-language model
Abstract: Semantic segmentation across multiple sensor modalities involves leveraging both shared and modality-specific cues. Existing approaches often rely on modality-specific specialist models, which can result in redundancy and suboptimal results. In this work, we propose SegRGB-X, a general model designed to jointly address semantic segmentation across five diverse multi-modal datasets. Our framework incorporates three key components: (1) Modality-Aware CLIP (MA-CLIP), fine-tuned with LoRA to extract modality-specific features; (2) a modality-aligned embedding mechanism that introduces modality-aligned prompts to mitigate the feature gap between input embeddings and control prompts; and (3) a Domain-Specific Refinement Module (DSRM) at the final stage of the backbone to adaptively refine modality-specific features. Extensive experiments on five datasets encompassing event, thermal, depth, polarization, and light field modalities demonstrate the effectiveness of SegRGB-X. Our model achieves an average mIoU of 65.03%, outperforming previous specialist models. The codes will be available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24148
Loading