Deep Instruction Tuning for Segment Anything Model

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work introduces two Deep Instruction Tuning (DIT) methods for the Segment Anything Model (SAM), aiming to enhance its performance on text-instructed image segmentation tasks. Despite SAM's impressive capabilities in various image segmentation tasks, its performance on text-guided segmentation was identified as relatively weaker, primarily due to the limitations in its default lightweight mask decoder for handling text instructions. The proposed methods end-to-end and layer-wise DIT strategically improve the interaction between text instructions and visual features without altering SAM's architecture significantly. These methods enable deeper integration of text instructions into the segmentation process, treating the image encoder of SAM as a standalone vision-language learner. Through extensive experiments across three benchmark datasets, it was demonstrated that the simple end-to-end DIT significantly improves SAM's performance, with the layer-wise DIT further boosting its effectiveness to achieve state-of-the-art results. This contribution not only showcases the potential of deep instruction tuning in enhancing multi-modal interaction within existing models but also sets new benchmarks for text-guided image segmentation tasks, showcasing how fine-grained tuning approaches can substantially augment the capabilities of multi-modal models in multimedia and multi-modal processing fields​.
Supplementary Material: zip
Submission Number: 1075
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview