Learning Frequency-Adapted Vision Foundation Model for Domain Generalized Semantic Segmentation

Published: 25 Sept 2024, Last Modified: 15 Jan 2025NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Semantic Segmentation, Domain Generalization, Vision Foundation Model, Haar Wavelets
TL;DR: Learning style-invariant property from vision fundation model from the frequency space by Haar wavelet transform.
Abstract: The emerging vision foundation model (VFM) has inherited the ability to generalize to unseen images. Nevertheless, the key challenge of domain-generalized semantic segmentation (DGSS) lies in the domain gap attributed to the cross-domain styles, i.e., the variance of urban landscape and environment dependencies. Hence, maintaining the style-invariant property with varying domain styles becomes the key bottleneck in harnessing VFM for DGSS. The frequency space after Haar wavelet transformation provides a feasible way to decouple the style information from the domain-invariant content, since the content and style information are retained in the low- and high- frequency components of the space, respectively. To this end, we propose a novel Frequency-Adapted (FADA) learning scheme to advance the frontier. Its overall idea is to separately tackle the content and style information by frequency tokens throughout the learning process. Particularly, the proposed FADA consists of two branches, i.e., low- and high- frequency branches. The former one is able to stabilize the scene content, while the latter one learns the scene styles and eliminates its impact to DGSS. Experiments conducted on various DGSS settings show the state-of-the-art performance of our FADA and its versatility to a variety of VFMs. Source code is available at \url{https://github.com/BiQiWHU/FADA}.
Primary Area: Machine vision
Submission Number: 829
Loading