Spatial Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

10 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: CLIP,Attention, Training-Free, Open Vocabulary, Semantic Segmentation
TL;DR: we introduce TLH-CLIP, a novel training-free framework that systematically exploits the spatial discriminability across Token, Layer and Head levels in CLIP for dense predictions for open-vocabulary sementic segmentation.
Abstract: Extending CLIP models to semantic segmentation remains a considerable challenge, largely due to the misalignment between their image-level pre-training objectives and the pixel-level spatial understanding required for dense predictions. Prior efforts have achieved encouraging results by reorganizing the final layer and feature representations of CLIP to enhance dense predictions. However, these approaches often inherit the global alignment bias of the final layer, leading to suboptimal spatial discriminability and segmentation performance. In this work, we propose TLH-CLIP, a novel training-free framework that systematically exploits the spatial discriminability across Token, Layer and Head levels in CLIP for dense predictions. Through comprehensive analysis, we uncover three key findings: (i) some anomalous tokens emerges in the final layers, which are category-agnostic but disproportionately attract attention from semantically meaningful patch tokens, thereby degrading spatial discriminability; (ii) the final few layers primarily enhance global image-text alignment with great sacrifice of local discriminability (e.g., last 3 layers in ViT-B-16 and 5 layers in ViT-L-14); (iii) a few attention heads (e.g., 10 out of 144 in ViT-B/16) demonstrate strong spatial discriminability across different datasets. Motivated by these insights, we propose three complementary techniques: abnormal token replacement, semantic-spatial reweighting, and selective head enhancement to effectively recover spatial coherence and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that TLH-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 18173
Loading