everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
CLIP has greatly advanced zero-shot segmentation by leveraging its strong visual-language association and generalization capability. However, directly adapting CLIP for segmentation often yields suboptimal results due to inconsistencies between image and pixel-level prediction objectives. Additionally, merely combining segmentation and CLIP models often leads to disjoint optimization, introducing significant computational overhead and additional parameters. To address these issues, we propose a novel CLIP-to-Seg Distillation approach, incorporating global and local distillation to flexibly transfer CLIP’s powerful zero-shot generalization capability to existing closed-set segmentation models. Global distillation leverages CLIP’s CLS token to condense segmentation features and distills high-level concepts to the segmentation model via image-level prototypes. Local distillation adapts CLIP’s local semantic transferability to dense prediction tasks using object-level features, aided by pseudo-mask generation for latent unseen class mining. To further generalize the CLIP-distilled segmentation model, we generate latent embeddings for the mined latent classes by coordinating their semantic embeddings and dense features. Our method equips existing closed-set segmentation models with strong generalization capabilities for open concepts through effective and flexible CLIP-to-Seg distillation. Without relying on the CLIP model or adding extra computational overhead/parameters during inference, our method can be seamlessly integrated into existing segmentation models and achieves state-of-the-art performance on multiple zero-shot segmentation benchmarks.