AlignCLIP: Self-Guided Alignment for Remote Sensing Open-Vocabulary Semantic Segmentation

ICLR 2026 Conference Submission6740 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models
Abstract: Open-Vocabulary Semantic Segmentation (OVSS) for remote sensing imagery plays a crucial role in applications such as land cover mapping and environmental monitoring. Recently, Contrastive Language-Image Pre-training (CLIP) has advanced the *training-free* paradigm of OVSS while also inspiring its exploration in the remote sensing domain. However, directly applying CLIP to remote sensing leads to cross-modal mismatches. Prevalent methods focus on exploring attention mechanism of CLIP visual encoder or introducing vision foundation models to obtain more discriminative feature, but they often overlook the alignment between patches and textual representations. To address this issue, we propose a *training-free* framework named **AlignCLIP**. We find that, objects of the same category tend to exhibit a more compact distribution in remote sensing, this enables a single visual feature to effectively represent all objects within the category. Based on this observation, we design the *Self-Guided Alignment (SGA)* module, which leverages the most reliable image-specific visual prototypes to refine the text embeddings. To mitigate interference among irrelevant features, we further introduce the *Cluster-Constrained Enhancement (CCE)* module, which clusters semantically similar patch features, suppresses inter-cluster correlations, and updates the logits map via a constraint propagation mechanism. Experiments on eight remote sensing benchmarks demonstrate that AlignCLIP consistently outperforms state-of-the-art *training-free* OVSS methods, achieving an average gain of +2.2 mIoU and offering a robust adaptive solution for open-vocabulary semantic segmentation in remote sensing. All code will be released.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6740
Loading