SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Published: 02 Nov 2023, Last Modified: 18 Dec 2023UniReps PosterEveryoneRevisionsBibTeX
Supplementary Material: pdf
Keywords: Foundation Model, Multi-Task Learning, Zero-Shot Learning, Semantic Segmentation
TL;DR: We merge vision foundation models SAM and CLIP into SAM-CLIP, a unified vision backbone with both semantic and spatial understanding, edge-device friendly, and achieving state-of-the-art results in zero-shot semantic segmentation.
Abstract: The landscape of publicly available vision foundation models (VFMs), such as CLIP and SAM, is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pretraining objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe based on multi-task distillation to efficiently merge VFMs into a unified model that assimilates their expertise. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces \emph{synergistic functionalities}, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8\% and +5.9\% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.
Track: Extended Abstract Track
Submission Number: 52