CGEBench: Benchmarking Concept Generalization of Promptable Image Segmentation Models

Alexander von Recum; Christoph Schnabl

CGEBench: Benchmarking Concept Generalization of Promptable Image Segmentation Models

Alexander von Recum, Christoph Schnabl

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: image segmentation, promptable image segmentation, concept generalization

TL;DR: We introduce CGEBench, a benchmark for evaluating concept generalization in promptable image segmentation models, and show that SAM-3 exhibits significant inconsistencies when segmenting objects under increasingly general concept prompts.

Abstract: Promptable image segmentation models have emerged as an evolution of image segmentation models with fixed classes. This approach allows for great flexibility during usage, enabling the user to prompt the model using a broad set of text prompts. However, it is unknown how robust these models are to generalizations of concepts, or "hypernyms". For example, if a model is prompted with "orange cat", "cat", and "animal", will the mask for the more specific concept be contained within masks for more general concepts? Has the model learned a consistent concept hierarchy, where each concept entails a broader concept? To evaluate this, we introduce CGEBench, a modified version of SaCo-Gold for evaluating concept generalization of open-vocabulary image segmentation models. We evaluate SAM-3, a state-of-the-art image segmentation model, on CGEBench and show that it exhibits inconsistencies when generalizing to more abstract concepts, with concepts of increasing generality being labeled increasingly less consistently on average. We also find that the distribution of intersection-over-ground-truth values is almost entirely bimodal, with a concept most often being either recognized completely correctly or not at all. These results position concept generalization of future promptable image segmentation models as an important area for benchmarking and improvement.

Submission Number: 26

Loading