Segment Anything in Context with Vision Foundation Models

Published: 2025, Last Modified: 07 Jan 2026Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the advent of large-scale pre-training, vision foundation models have emerged as powerful tools for open-world image understanding, showcasing remarkable capabilities across a range of visual tasks. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models often require the integration of task-specific architectural modifications and extensive fine-tuning to achieve satisfactory performance in specific domains. This limitation not only increases the complexity of deployment but also restricts their broader applicability in dynamic, real-world scenarios. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various segmentation tasks. Matcher can segment anything by using in-context examples without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse tasks. Specifically, we develop a bidirectional matching strategy to ensure precise visual matching. Then, we design a robust prompt sampler that generates mask proposals with diverse semantic granularity. To further enhance accuracy, we propose an innovative instance-level matching strategy that effectively filters out false-positive mask fragments. In addition, we deploy another vision foundation model to retrieve in-context examples for better prompt engineering for Matcher. Our comprehensive experiments demonstrate that Matcher, without any additional training, has impressive generalization performance across a wide range of visual tasks, underscoring its substantial potential in advancing toward general perception.
Loading