MOSAIC: Multimodal Object and Semantic Segmentation with Adapter Integration and Contextual Fusion

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal RGB-IR, object detection, semantic segmentation, computer vision, modality misalignment
TL;DR: This paper presents MOSAIC, a framework enhancing multimodal RGB-IR object detection and semantic segmentation by using Vision Transformers with novel modules, achieving state-of-the-art results on benchmark datasets.
Abstract: We introduce MOSAIC, a novel framework for enhancing multimodal RGB-IR object detection and semantic segmentation. MOSAIC utilizes Vision Transformers and introduces modules like the Deformable Feature Sampling, Feature Attention Fusion Block, and Contextual Feature Enhancer. These components dynamically align and integrate RGB-IR features, capturing multi-scale contextual information to enhance object detection and segmentation tasks. Extensive evaluations demonstrate that MOSAIC achieves state-of-the-art results on FLIR, LLVIP, MFNet and VT-series benchmark datasets, significantly improving robustness and accuracy in RGB-IR downstream tasks.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8342
Loading