Keywords: Language-Guided Segmentation, MLLMs, SAM
TL;DR: In this paper, we propose Seg-Agent, a completely training-free language-guided segmentation method.
Abstract: Language-guided segmentation breaks through the scope limitations of traditional semantic segmentation, enabling models to segment any target region in an image based on user instructions. Existing methods are typically two-stage frameworks: they first employ multimodal large language models (MLLMs) to understand the textual instruction and generate visual prompts from the image, and then use foundational segmentation models such as SAM to produce high-quality masks. However, due to the limited spatial grounding capability of the base models, they usually require training on large-scale datasets to achieve improved segmentation accuracy. In this paper, we propose Seg-Agent, a completely training-free language-guided segmentation method. By constructing an explicit reasoning chain: generation, selection, and refinement, Seg-Agent achieves performance comparable to training-based approaches. Additionally, to evaluate the generalization ability of Seg-Agent, we collect a diverse dataset covering various language-guided segmentation scenarios, named Various-LangSeg. Extensive experiments demonstrate the effectiveness of our proposed method. The code and dataset will be made publicly available.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6893
Loading