Abstract: Cost-effective salient object detection (SOD) in multi-modal scenarios is a crucial yet challenging task. This work explores a unified multi-modal SOD approach that demonstrate strong performance across various domain-specific datasets. However, existing methods often struggle with poor representation ability across data distributions and inefficient adaptiveness to multi-modal data. To address these challenges, we propose UMMSOD, a unified multi-modal salient object detection framework, which integrates a novel frequency-based prompt enhancement generator and a modified cross-modal adapter, achieving an effective balance between accuracy and efficiency. Concretely, this paper designs a novel frequency-based prompt enhancement generator with a spatial self-attention mechanism to extract salient features across different modalities, enhancing representation capability. Then, a modified cross-modal adapter utilizes the multi-scale features to facilitate modality knowledge integration while effectively bridging the gap between different modalities. Extensive experimental results on 15 major benchmarks in multi-modal SOD tasks demonstrate that UMMSOD achieves competitive performance while introducing only 1.89M cross-modality trainable parameters.
External IDs:doi:10.1145/3731715.3733459
Loading