Native Segmentation Vision Transformers

Guillem Braso; Aljosa Osep; Laura Leal-Taixé

Native Segmentation Vision Transformers

Guillem Braso, Aljosa Osep, Laura Leal-Taixé

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformers, Segmentation, Perceptual Grouping

TL;DR: We introduce a segmentation-centric backbone architecture built around a spatial grouping layer, capable of generating high-quality segmentation masks through feature extraction.

Abstract: Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises *natively* in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of *native*, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 5525

Loading