Abstract: Multi-modal salient object detection (SOD) through the integration of additional data such as depth or thermal information has become a significant task in computer vision during recent years. Traditionally, the challenges of identifying salient objects in RGB, RGB-D (Depth), and RGB-T (Thermal) images are tackled separately, which often leads to issues like poorly defined object edges or overconfident inaccurate predictions. Recent studies have shown that designing a unified end-to-end framework to handle all these three types of SOD tasks simultaneously is both necessary and difficult. To address this need, we propose a novel approach that treats multi-modal SOD as a conditional mask generation task utilizing diffusion models. Specifically, we introduce DiMSOD, which enables the concurrent use of local (depth maps, thermal maps) and global controls (images) within a unified model for progressive denoising and refined prediction. DiMSOD is efficient, only requiring fine-tuning of local control adapter on the existing stable diffusion model, which not only reduces the fine-tuning cost and model size, making it more viable for real-world applications, but also enhances the integration of multi-modal conditional controls. Additionally, we have developed modules including SOD-ControlNet, Feature Adaptive Network (FAN), and Feature Injection Attention Network (FIAN) to further enhance the model's performance. Extensive experiments demonstrate that DiMSOD efficiently detects salient objects across RGB, RGB-D, and RGB-T datasets, achieving superior performance compared to previous methods. Our code and datasets are accessible at: https://anonymous.4open.science/r/DiMSOD-0B47.
Loading