MFS: A Saliency Driven Interactive Multimodal Fusion Framework for Robust Semantic Segmentation in Complex and Occluded Scenes

ICLR 2026 Conference Submission12933 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic segmentation, target detection, multimodal fusion, weak targets
Abstract: In complex scenes, semantic segmentation often encounters challenges such as difficulty in detecting distant small or weak targets and recognizing occluded objects. Existing methods still suffer from limited robustness and suboptimal multimodal feature fusion. To address these issues, this paper proposes an interactive multimodal semantic segmentation framework based on frequency domain dynamic routing and activation region guidance, which effectively enhances the feature extraction capability, fusion robustness, and semantic representation of multimodal images. The proposed framework consists of three core modules: first, an edge feature enhancement module that performs fine-grained selection of key regions on the initial features to enhance weak targets and edge details; second, an activation region guided hybrid attention module that effectively fuses prominent region information from infrared and visible modalities; and finally, a deep semantic enhancement learning module that incorporates dynamic convolutional masks to improve the semantic consistency of fused features at both global and local levels. Experimental results on multiple public datasets demonstrate that the proposed method outperforms existing approaches in terms of image fusion quality, segmentation accuracy, and object detection performance, showing especially strong robustness and generalization ability in complex and occluded scenes.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12933
Loading