Learning Spectral-decomposited Tokens for Domain Generalized Semantic Segmentation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid development of Vision Foundation Model (VFM) brings superior out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information (i.e., semantics) but vary greatly in terms of the style variation (e.g., urban landscape, environment dependencies). How to effectively fine-tune VLM for DGSS has recently become an open research topic for the vision community. In this paper, we present a novel Spectral-decomposited Tokens (SET) learning framework to push the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on how to learn style-invariant features from these learnable tokens. Specifically, the frozen VLM features are first decomposited into the phase and amplitude component respectively in the frequency space, where the phase / amplitude component reflects more on the content / style, respectively. Then, learnable tokens are adapted to learn the content and style, respectively. As the cross-domain differences mainly rest in the style from the amplitude component, such information is decoupled from the tokens. Consequently, the refined feature maps are more stable to represent the pixel-wise content despite the style variation. Extensive cross-domain experiments under a variety of backbones and VFMs show the state-of-the-art performance. We will make the source code publicly available.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: The rapid development of Vision Foundation Model (VFM) brings superior out-domain generalization for a variety of down-stream tasks. How to effectively fine-tune VLM for domain generalized semantic segmentation (DGSS) has recently become an open research topic for the vision community. In this paper, we present a novel amplitude-decoupled token (ADT) learning framework to push the frontier.
Supplementary Material: zip
Submission Number: 870
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview