Keywords: Vision Transformers, Segmentation, Perceptual Grouping
TL;DR: We introduce a segmentation-centric backbone architecture built around a spatial grouping layer, capable of generating high-quality segmentation masks through feature extraction.
Abstract: Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises *natively* in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer.
We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of *native*, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 5525
Loading