LSMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang; Youpeng Zhao; Yan Huang; Zhen Yao; En Yu; Min Xu; Qiang Wu

LSMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang, Youpeng Zhao, Yan Huang, Zhen Yao, En Yu, Min Xu, Qiang Wu

03 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Open-vocabulary semantic segmentation, large language models, visual attributes

TL;DR: We introduce LSMSeg, a pioneering framework that leverages large language models (LLMs) to create detailed, attribute-enriched text prompts, significantly improving text-visual alignment for OVSS.

Abstract: Open-vocabulary semantic segmentation requires precise pixel-level alignment of visual and textual representations, leveraging text as a universal reference to address visual disparities across diverse datasets. While prior efforts have primarily focused on enhancing visual representations or alignment models, the contribution of textual representations remains underexplored. Moreover, although CLIP excels at capturing image-level features, its limited capacity for fine-grained pixel-level representation poses a major challenge for semantic segmentation. To address these challenges, we propose LSMSeg that employs large language models (LLMs) to generate enriched text prompts incorporating diverse visual attributes such as color, shape, size, and texture, thereby replacing simplistic templates with semantically rich descriptions. In addition, we propose a Feature Refinement Module that adapts visual features from the Segment Anything Model (SAM) to the CLIP space using a lightweight adapter, followed by a learnable weighting strategy to fuse them with CLIP features, enhancing pixel-to-text alignment. To further reduce computational overhead, we introduce a Category Filtering Module to accelerate training and decrease parameter complexity. Extensive experiments demonstrate that LSMSeg significantly enhances cross-modal alignment and achieves strong performance while maintaining efficiency, offering a robust advancement for open-vocabulary semantic segmentation.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1189

Loading