Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

TMLR Paper6751 Authors

01 Dec 2025 (modified: 08 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across \emph{layer}, \emph{head}, and \emph{token} levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image–text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Comprehensive experiments on eight widely used semantic segmentation benchmarks demonstrate that LHT-CLIP achieves substantial performance improvements across diverse scenarios, underscoring its effectiveness and practicality for real-world deployment.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jose_Dolz1
Submission Number: 6751
Loading