Keywords: 2D Segmentation, Vision Language Model, Robotic Perception
Abstract: Promptable segmentation has lowered the barrier to extracting pixel-accurate regions, yet current models are essentially part-level engines: they respond well to local cues but remain agnostic to object, frequently fragmenting a single instance into multiple masks. We present a training-free pipeline that lifts part-level predictions to object-level masks by coupling open-vocabulary semantics from a vision–language model (VLM) with a SAM2 grounding-and-masking backend. The VLM inventories the scene and returns a normalized list of object names and aliases. The labels, without any boxes, points, or hand-crafted prompts, are passed to the grounding–segmentation stack, which produces instance-consistent masks for each named object. A lightweight orchestration layer handles name canonicalization, synonym expansion, and conflict resolution (e.g., “table” versus “table leg”) and consolidates fragments while preserving the boundary quality of the underlying segmenter. On a variety of everyday scenes, we show that our model can handle both real-world images as well as simulation renderings. It is sufficient to obtain object-aligned masks that are directly usable for object-centric editing and downstream reasoning, as well as practical robot perception tasks such as grasp planning and object-centric mapping. Beyond practicality, our findings argue for a clean separation of concerns—semantics from VLMs, spatial precision from promptable segmenters, as a robust, open-vocabulary front end for object-level scene understanding and a lightweight component in robotics pipelines.
Submission Number: 4
Loading