Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Existing methods on SOD utilize various Transformer-based models for feature extraction. However, due to the scale of training datasets and training methods, these Transformer-based models still lack performance and generalization in segmentation. Segment Anything Model (SAM) is trained on a large-scale segmentation dataset, which gives it strong generalization and segmentation capabilities. Nonetheless, SAM requires accurate prompts of target objects, which is unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-layer information, as well as the incorporation of fine-grained details. In order to apply SAM to SOD, and address its shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM). Specifically, we introduce a Lightweight Multi-scale Adapter (LMSA), which allows SAM to learn multi-scale information with few trainable parameters. Moreover, we propose a Multi-Layer Fusion Block (MLFB) to comprehensively utilize the multi-layer information from the SAM's encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization to other segmentation tasks. The source code will be publicly available.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Our main contributions are summarized as follows: 1) We propose a novel framework for adapting SAM to SOD, named Multi-scale and Detail-enhanced SAM (MDSAM). 2) We introduce a Lightweight Multi-Scale Adapter (LMSA) for learning task-specific information while being training-efficient and strong in acquiring multi-scale information. 3) We comprehensively utilize the multi-layer information from the SAM's image encoder by using our proposed Multi-Layer Fusion Block (MLFB). 4) We propose the Detail Enhancement Module (DEM) to introduce fine-grained details to segmentation results. 5) We perform extensive experiments on mainstream datasets to verify the effectiveness of our MDSAM. Further experiments are conducted to demonstrate the strong generalization of our proposed model. All the above contributions of our work are very related to visual media and the conference.
Supplementary Material: zip
Submission Number: 1376
Loading