Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance
Abstract: Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance.
Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information.
Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task.
In this paper, we propose a novel agent \textbf{Query}-driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff.
Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene;
(2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features;
(3) refining segmentation features using optimized agent queries for robust mask predictions.
Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods.
Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles. Code is available at https://github.com/FanLiHub/QueryDiff.
Lay Summary: Deep learning models often struggle to accurately recognize and segment objects in images that differ significantly from their training data, such as those captured under varying weather conditions or with unusual artistic styles.
To address this challenge, we introduce QueryDiff, a novel method that leverages the distributional knowledge of diffusion models to guide generalizable visual understanding. QueryDiff constructs “agent queries” that extract and aggregate semantic information about objects within a scene. These queries learn the scene’s underlying semantic distribution through guidance from diffusion features, allowing the model to build a robust understanding of the scene’s structure. The refined queries are then used to enhance the segmentation predictions.
Experiments demonstrate that QueryDiff achieves significant performance gains across diverse scenarios and generalizes well to visually extreme domains, such as cubist-style art. This advancement not only improves the robustness of deep learning models but also enhances their practicality in real-world tasks that involve diverse and unpredictable visual conditions.
Primary Area: Applications->Computer Vision
Keywords: semantic segmentation, domain generalization, diffusion model
Submission Number: 4280
Loading