Aligning Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Published: 11 Jan 2026, Last Modified: 24 Feb 2026IEEE TCSVT 2026EveryoneRevisionsCC BY-NC 4.0
Abstract: Pre-trained Vision-Language Models (VLMs) are often used to tackle the challenging task of Open-vocabulary Segmentation (OVS). To preserve the valuable pre-trained knowledge of VLM-based mask classifiers, most existing approaches freeze their parameters during training. However, our comprehensive analysis identifies a previously overlooked limitation: the performance of OVS is primarily constrained by mask classification. Specifically, VLMs pre-trained using globally pooled image-text representations often fail to capture localized, region-specific semantics necessary for accurate segmentation. This discovery motivates us to improve the fine-grained alignment between word-level text features and pixel-level image features extracted by VLMs. To this end, we propose the Fine-grained Semantic Reconstruction (FiSeR), a novel auxiliary task designed to enrich the spatial semantic detail of visual features. FiSeR trains the model to predict a randomly masked target class label using the image features and the remaining unmasked text. This encourages the model to link the specific words to the corresponding image regions, improving its ability to recognize and segment objects at the region level. FiSeR is broadly applicable and can be incorporated into various VLM-based segmentation models to improve their performance. Additionally, we introduce the Text-guided Visual Aligner (TeVA), a lightweight network module that injects relevant fine-grained semantics from the text information early in the visual encoding process. This enables the model to condition its visual processing on the target text categories from the beginning, improving its ability to associate text with the correct spatial regions. Collectively, these innovations culminate in our proposed framework FOV-Seg. Notably, FOV-Seg achieves new state-of-the-art results across multiple representative OVS benchmarks, improving performance consistently and reducing training costs by nearly 5× compared to previous best methods. Our code and data will be released.
Loading