TIDES: Training-free Instance Detection from Semantics

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computer Vision, Training-Free Open-Vocabulary Instance Segmentation
TL;DR: We propose TIDES, a novel pipeline that integrates the semantic understanding of dual-encoders with the instance awareness of promptable segmentation models, enabling accurate instance segmentation without training.
Abstract: Efforts to leverage the coarse semantic understanding of vision-language dual encoder models, such as CLIP, for dense prediction tasks without training have shown promise, particularly in training-free open-vocabulary semantic segmentation (TF-OVSS). However, instance segmentation (TF-OVIS) remains largely unexplored because dual encoder models cannot distinguish individual instances on their own. We systematically evaluate the suitability of promptable segmentation models (PSMs), such as SAM, as sources of accurate instance delineation and present TIDES (Training-free Instance Detector from Semantics), a pipeline that repurposes any pair of TF-OVSS and PSM for instance segmentation. At its core is our instance-oriented (IO) scoring, which leverages patch-level semantic alignments from TF-OVSS to re-evaluate PSM-generated masks, accurately identifying individual object instances without training, instance-level labels, or external detectors. Extensive evaluation on the MS COCO-based OVIS benchmark across multiple TF-OVSS and PSM combinations demonstrates TIDES’ flexibility and effectiveness: it surpasses the previous best TF-OVIS method by 9.2 AP and naive baselines with the original scoring by 2.7 AP.
Primary Area: transfer learning, meta learning, and lifelong learning
Supplementary Material: pdf
Submission Number: 4547
Loading