Keywords: Domain Generalization, Schrödinger Bridge, Image-Text Alignment, Optimal Transport
TL;DR: This paper proposes a Cross-Modal Schrödinger Bridge to align the domain-specific imges to the domain-invariant text, so as to enhance generalization to unseen domains.
Abstract: Domain generalization aims to train models that perform robustly on unseen target domains without access to target data.
The realm of vision-language foundation model has opened a new venue owing to its inherent out-of-distribution generalization capability.
However, the static alignment to class-level textual anchors remains insufficient to handle the dramatic distribution discrepancy from diverse domain-specific visual features.
In this work, we propose a novel cross-domain Schrödinger Bridge (SB) method, namely SBGen, to handle this challenge, which explicitly formulates the stochastic semantic evolution, to gain better generalization to unseen domains.
Technically, the proposed \texttt{SBGen} consists of three key components: (1) \emph{text-guided domain-aware feature selection} to isolate semantically aligned image tokens; (2) \emph{stochastic cross-domain evolution} to simulate the SB dynamics via a learnable time-conditioned drift; and (3) \emph{stochastic domain-agnostic interpolation} to construct semantically grounded feature trajectories.
Empirically, \texttt{SBGen} achieves state-of-the-art performance on domain generalization in both classification and segmentation. This work highlights the importance of modeling domain shifts as structured stochastic processes grounded in semantic alignment.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 7080
Loading