DUPS: Dynamic upsampling for efficient semantic segmentation

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computer Vision, Transformers, Vision Transformers, Semantic Segmentation, Efficient
TL;DR: We propose a computer vision encoder for semantic segmentation that dynamically upsamples regions with semantic boundaries into higher resolutions to improve performance.
Abstract: We present \textbf{DUPS}, a coarse-to-fine vision transformer for semantic segmentation. Unlike models that begin with dense high-resolution tokens, DUPS starts at low resolution and dynamically upsamples only regions predicted to contain semantic boundaries, following a “one-token-one-class” principle. Mixed-resolution attention enables interaction between coarse and fine tokens, allocating computation to semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments on ADE20K, COCO-Stuff, and Cityscapes demonstrate that DUPS achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive accuracy on Cityscapes at markedly lower compute. For example, DUPS-Base attains \textbf{54.6 mIoU} on ADE20K in the $\sim$110M-parameter class while using fewer FLOPs than comparable backbones.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9301
Loading