CMAD-UNet: UNet-Driven RGB-D Salient Object Detection with Cross-Modal Consistency and Aggregative Decoding

Qi Xu, Zhaozhao Su, Zhaoru Guo, Yongming Li, Liejun Wang, Panpan Zheng

Published: 2025, Last Modified: 23 Jan 2026ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Current RGB-D salient object detection (SOD) methods predominantly rely on simplistic cross-modal fusion strategies that inadequately model intrinsic inter-modal correlations and underutilize hierarchical representations in U-shaped architecture. This often leads to incomplete predictions and boundary ambiguities due to insufficient exploitation of low-level spatial details. To address these limitations, we propose the Cross-Modal Consistency and Aggregative Decoding in enhanced UNet (CMAD-UNet), a novel architecture designed to advance saliency detection through synergistic multi-modal correlation fusion and adaptive feature aggregation. Specifically, this network integrates three core modules: Enhanced Atrous Spatial Pyramid Pooling Module (EnASP) embedded in a dual-branch hybrid Convolutional Neural Networks (CNNs) and Transformer encoder expands the receptive field to capture global contextual information; Correlation-Perceptive Fusion Module (CPFM) constructs learnable affinity matrices through contrastive consistency learning, effectively suppressing cross-modal noise while aligning feature distributions; Selective Aggregation Decoding Module (SADM) dynamically weights multi-level features via channel-spatial attention mechanisms during upsampling, suppressing noise propagation and generating saliency maps with well-defined boundaries. Through extensive experiments on 8 benchmark datasets evaluated with 5 metrics, CMAD-UNet demonstrates superior performance compared to 14 state-of-the-art approaches.

External IDs:dblp:conf/mir/XuSGLWZ25