Learning Spectral-decomposited Tokens for Domain Generalized Semantic Segmentation

Jingjun Yi; Qi Bi; Hao Zheng; Haolan Zhan; Wei Ji; Yawen Huang; Yuexiang Li; Yefeng Zheng

Learning Spectral-decomposited Tokens for Domain Generalized Semantic Segmentation

Jingjun Yi, Qi Bi, Hao Zheng, Haolan Zhan, Wei Ji, Yawen Huang, Yuexiang Li, Yefeng Zheng

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid development of Vision Foundation Model (VFM) brings superior out-domain generalization for a variety of down-stream tasks. Among them, domain generalized semantic segmentation (DGSS) holds unique challenges as the cross-domain images share common pixel-wise content information (i.e., semantics) but vary greatly in terms of the style variation (e.g., urban landscape, environment dependencies). How to effectively fine-tune VLM for DGSS has recently become an open research topic for the vision community. In this paper, we present a novel Spectral-decomposited Tokens (SET) learning framework to push the frontier. Delving into further than existing fine-tuning token & frozen backbone paradigm, the proposed SET especially focuses on how to learn style-invariant features from these learnable tokens. Specifically, the frozen VLM features are first decomposited into the phase and amplitude component respectively in the frequency space, where the phase / amplitude component reflects more on the content / style, respectively. Then, learnable tokens are adapted to learn the content and style, respectively. As the cross-domain differences mainly rest in the style from the amplitude component, such information is decoupled from the tokens. Consequently, the refined feature maps are more stable to represent the pixel-wise content despite the style variation. Extensive cross-domain experiments under a variety of backbones and VFMs show the state-of-the-art performance. We will make the source code publicly available.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: The rapid development of Vision Foundation Model (VFM) brings superior out-domain generalization for a variety of down-stream tasks. How to effectively fine-tune VLM for domain generalized semantic segmentation (DGSS) has recently become an open research topic for the vision community. In this paper, we present a novel amplitude-decoupled token (ADT) learning framework to push the frontier.

Supplementary Material: zip

Submission Number: 870

Loading