DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut
Abstract: Foundation models have emerged as powerful tools across various domains includ-
ing language, vision, and multimodal tasks. While prior works have addressed
unsupervised image segmentation, they significantly lag behind supervised models.
In this paper, we use a diffusion UNet encoder as a foundation vision encoder
and introduce DiffCut, an unsupervised zero-shot segmentation method that solely
harnesses the output features from the final self-attention block. Through extensive
experimentation, we demonstrate that the utilization of these diffusion features in a
graph based segmentation algorithm, significantly outperforms previous state-of-
the-art methods on zero-shot segmentation. Specifically, we leverage a recursive
Normalized Cut algorithm that softly regulates the granularity of detected objects
and produces well-defined segmentation maps that precisely capture intricate image
details. Our work highlights the remarkably accurate semantic knowledge embed-
ded within diffusion UNet encoders that could then serve as foundation vision
encoders for downstream tasks. Project page: https://diffcut-segmentation.github.io
Loading