Feature Design for Bridging SAM and CLIP Toward Referring Image Segmentation

Koichiro Ito

Published: 2025, Last Modified: 04 Nov 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Referring Image Segmentation (RIS) is a task aimed at segmenting objects expressed in natural language within an image. This task requires an understanding of the relationship between vision and language, along with precise segmentation capabilities. In the field of computer vision, CLIP and Segment anything model (SAM) have gained significant attention for their classification and the segmentation capabilities. Given that both models possess essential skills for RIS, combining them seems to be an effective strategy. In this paper, we propose a model that integrates CLIP and SAM to enhance RIS. Since SAM lacks classification capabilities, we developed a module that supplies the SAM mask decoder with features that specify the target object. We introduce a new module, which is trained on additional instance segmentation tasks. The features utilized and derived from this module serve as inputs for the SAM decoder. With these inputs, SAM is expected to effectively segment areas corresponding to the given naturallanguage expressions. We conducted experiments using the traditional RefCOCO/+/g, as well as the recently introduced gRefCOCO and Refrom datasets, demonstrating the advantages of our approach. Code will be available on https://github.com/hitachi-rd-cv/dfam.

External IDs:dblp:conf/wacv/Ito25