Unified Multi-modal Salient Object Detection via Frequency Prompt and Adapter Tuning

Chaojun Cen, Fei Li, Zhenbo Li

Published: 30 Jun 2025, Last Modified: 16 Oct 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Cost-effective salient object detection (SOD) in multi-modal scenarios is a crucial yet challenging task. This work explores a unified multi-modal SOD approach that demonstrate strong performance across various domain-specific datasets. However, existing methods often struggle with poor representation ability across data distributions and inefficient adaptiveness to multi-modal data. To address these challenges, we propose UMMSOD, a unified multi-modal salient object detection framework, which integrates a novel frequency-based prompt enhancement generator and a modified cross-modal adapter, achieving an effective balance between accuracy and efficiency. Concretely, this paper designs a novel frequency-based prompt enhancement generator with a spatial self-attention mechanism to extract salient features across different modalities, enhancing representation capability. Then, a modified cross-modal adapter utilizes the multi-scale features to facilitate modality knowledge integration while effectively bridging the gap between different modalities. Extensive experimental results on 15 major benchmarks in multi-modal SOD tasks demonstrate that UMMSOD achieves competitive performance while introducing only 1.89M cross-modality trainable parameters.

External IDs:doi:10.1145/3731715.3733459