% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024/uai2024} % ptmx math instead of Computer
% Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024/uai2024} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz}
\usetikzlibrary{arrows.meta, backgrounds, decorations, fit, positioning, calc}
\tikzset{fit margins/.style={/tikz/afit/.cd,#1,
		/tikz/.cd,
		inner xsep=\pgfkeysvalueof{/tikz/afit/left}+\pgfkeysvalueof{/tikz/afit/right},
		inner ysep=\pgfkeysvalueof{/tikz/afit/top}+\pgfkeysvalueof{/tikz/afit/bottom},
		xshift=-\pgfkeysvalueof{/tikz/afit/left}+\pgfkeysvalueof{/tikz/afit/right},
		yshift=-\pgfkeysvalueof{/tikz/afit/bottom}+\pgfkeysvalueof{/tikz/afit/top}},
	afit/.cd,left/.initial=2pt,right/.initial=2pt,bottom/.initial=2pt,top/.initial=2pt}

\newtheorem{definition}{Definition}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{assumption}{Assumption}
\newtheorem{claim}{Claim}

\usepackage{array}
\newcommand{\PreserveBackslash}[1]{\let\temp=\\#1\let\\=\temp}
\newcolumntype{C}[1]{>{\PreserveBackslash\centering}p{#1}}
\newcolumntype{R}[1]{>{\PreserveBackslash\raggedleft}p{#1}}
\newcolumntype{L}[1]{>{\PreserveBackslash\raggedright}p{#1}}

\usepackage{adjustbox}
\usepackage{calrsfs}
\usepackage{multirow}
\DeclareMathAlphabet{\pazocal}{OMS}{zplm}{m}{n}
\usepackage{soul}
\usepackage{subcaption}
\usepackage{appendix}
% \usepackage{lmodern}
% \renewcommand*\ttdefault{lmvtt}
% \renewcommand{\ttdefault}{cmtt}
% \renewcommand{\ttdefault}{pcr}
% \renewcommand*\ttdefault{txtt}
% \usepackage[charter]{mathdesign}
% \def\rmdefault{bch} % not scaled
% \def\ttdefault{blg}
\usepackage{inconsolata}

\newcommand{\given}{\,\vert\,}

%%%Nevin's marcos

%\newcommand{\mR}{\mathbb{R}}
\newcommand{\mE}{{\mathbb{E}}}
%\newcommand{\sP}{{\cal P}}
%\newcommand{\sD}{{\cal D}}
\def\w{{w}}
\newcommand{\bc}{\begin{center}}
	\newcommand{\ec}{\end{center}}
\newcommand{\bit}{\begin{itemize}}
	\newcommand{\eit}{\end{itemize}}
\newcommand{\ben}{\begin{enumerate}}
	\newcommand{\een}{\end{enumerate}}
\newcommand{\bH}{{H}}
\def\x{{x}}
\def\hy{\hat{y}}
\def\X{{X}}
\def\tx{\tilde{x}}
\def\z{{z}}
\DeclareMathOperator\supp{supp}
\newcommand{\diff}{\mathop{}\!d}
\newcommand{\ci}{\mathrel{\text{\scalebox{1.07}{$\perp\mkern-10mu\perp$}}}}

\newcommand{\Xc}{\X^\mathrm{c}}
\newcommand{\Xn}{\X^\mathrm{n}}
\newcommand{\xc}{\x^\mathrm{c}}
\newcommand{\xn}{\x^\mathrm{n}}
\newcommand{\Xcalc}{\mathcal{X}_\mathrm{c}}
\newcommand{\Xcaln}{\mathcal{X}_\mathrm{n}}

%\newtheorem{theorem}{Theorem}
\newcommand{\myskip}[1]{ }


%\newcommand{\myorange}{}
\newcommand{\myorange}{\color{orange}}

%\newcommand{\mylater}[1]{ }
\newcommand{\mylater}[1]{\footnote{\myorange #1}}

%\newcommand{\myblue}{}
\newcommand{\myblue}{\color{blue}}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\usepackage{shorthands}
\usepackage{shortbold}
\input{math_commands}

% Select what to do with command \comment:  
% \newcommand{\comment}[1]{}  %comment not showed
\newcommand{\comment}[1]{\textcolor{gray}{[#1]}} %comment showed

\newcommand\blfootnote[1]{%
  \begingroup
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}
\newcommand{\Wilds}{{Wilds}\xspace}
\newcommand{\suffix}{-WILDS}
\newcommand{\iWildCamName}{iWildCam2020}
\newcommand{\iWildCam}{{\iWildCamName{}\suffix{}}\xspace}
\newcommand{\ReLIC}{{ReLIC}\xspace}
\newcommand{\AugMix}{{AugMix}\xspace}
\newcommand{\CoRE}{{CoRE}\xspace}
\newcommand{\CamelyonName}{Camelyon17}
\newcommand{\CamelyonPrefix}{{\CamelyonName{}}\xspace}
\newcommand{\Camelyon}{{\CamelyonName{}\suffix{}}\xspace}

\makeatletter
\DeclarePairedDelimiterX{\infdivx}[2]{\big[}{\big]}{%
  #1\;\delimsize\|\;#2%
}
\makeatother

\title{Consistency Regularization for Domain Generalization\\with Logit Attribution Matching}
%Domain Generalization with Semantic Sharing Pairs}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors

% \author{Kaican Li\inst{1}\thanks{Equal contribution.} \and
% Kai Chen\inst{3}$^\star$ \and}
\author[1,2$*$\phantom{\thanks{Equal contribution, listed in alphabetical order.}}]{\href{mailto:<gaohan19@huawei.com>?Subject=LAM}{Han Gao}}
\author[2$*$]{\href{mailto:<klibf@cse.ust.hk>?Subject=LAM}{Kaican Li}}
\author[2$*$]{\href{mailto:<wxieai@cse.ust.hk>?Subject=LAM}{Weiyan Xie}}
\author[2]{\href{mailto:<zlinaz@connect.ust.hk>?Subject=LAM}{Zhi Lin}}
\author[1]{\href{mailto:<huang.yongxiang2@huawei.com>?Subject=LAM}{Yongxiang Huang}}
\author[1]{\\\href{mailto:<wangluning2@huawei.com>?Subject=LAM}{Luning Wang}}
\author[2]{\href{mailto:<cao@ust.hk>?Subject=LAM}{Caleb Chen Cao}}
\author[2$*$]{\href{mailto:<lzhang@cse.ust.hk>?Subject=LAM}{Nevin L. Zhang}}
% Add affiliations after the authors
\affil[1]{%
        Huawei Hong Kong AI Framework \& Data Technologies Lab
        % Hong Kong, China
}
\affil[2]{%
	The Hong Kong University of Science and Technology
        % Hong Kong, China
}


\begin{document}
\maketitle
% \vskip 20mm
\begin{abstract}
\vspace{-2mm}
Domain generalization (DG) is about training models that generalize well under domain shift. Previous research on DG has been conducted mostly in single-source or multi-source settings. In this paper, we consider a third, lesser-known setting where a training domain is endowed with a collection of pairs of examples that share the same semantic information.
Such semantic sharing (SS) pairs can be created via data augmentation and then utilized for consistency regularization (CR). We present a theory showing CR is conducive to DG and propose a novel CR method called Logit Attribution Matching (LAM). We conduct experiments on five DG benchmarks and four pretrained models with SS pairs created by both generic and targeted data augmentation methods. LAM outperforms representative single/multi-source DG methods and various CR methods that leverage SS pairs. The code and data of this project are available at \href{https://github.com/Gaohan123/LAM}{https://github.com/Gaohan123/LAM}.
\end{abstract}
	
\section{Introduction}


Deep learning models are successful under the independent and identically distributed (\iid) assumption that test data are drawn from the same distribution as training data.
However, models that generalize well in-distribution (ID) may be generalizing in unintended ways out-of-distribution (OOD)~\citep{szegedy2013intriguing, shah2020pitfalls, geirhos2020shortcut, di2022goal, yang2023glue}. Some image classifiers with great ID performance, in fact, rely on background and style cues to predict the class of foreground objects, leading to poor OOD performance~\citep{beery2018recognition, zech2018variable, xiao2020noise, geirhos2020shortcut}. Such reliance on spurious correlations hinders model performance under domain shift, affecting many real-world applications where the \iid assumption cannot be guaranteed~\citep{michaelis2019benchmarking, alcorn2019strike, koh2021wilds, ali2022assessing, li2022coda}.
	
\emph{Domain generalization} (DG) deals with the conundrum of generalizing under domain shift. Previous research on DG has mostly focused on the single-source and multi-source settings~\citep{zhou2022domain, wang2022generalizing}.
The single-source setting~\citep{volpi2018generalizing, hendrycks2019benchmarking} is the most general but also the most challenging setting where the domain of a datum is \emph{a priori} unknown. The lack of domain information makes it difficult to tell apart features that are invariant to domain shifts from those that are not. The multi-source setting~\citep{blanchard2011generalizing, muandet2013domain, ganin2016domain, arjovsky2019invariant}, on the other hand, assumes that such information is available to the degree that every datum is associated with a coarse domain label. Even so, however, it may require a prohibitively large number of diverse domains to solve real-world DG problems~\citep{wang2024lost}.


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=17cm]{figs1/uai2-ss-pairs-2-min.png}
\end{center}
\vspace{-0.4cm}
\caption{A semantic sharing (SS) pair involves an original training example and a transformed version of it obtained by data augmentation (DA). The examples in the first two pairs share the same semantic information for the ``giraffe'' class, and the examples in the last pair share the same semantic information for the ``dog'' class.  The augmented example in (a) is created manually via {Copy-Paste}~\citep{gao2023out}, the one in  (b) is created using a DA method called RandAugment~\citep{cubuk2020randaugment}, and the one in (c) is created using Stable Diffusion~\citep{rombach2022high} (see Appendix \ref{creation} for more details).  }
\label{fig:ss-pairs}
\end{figure*}

In this paper, we study a third lesser-known setting where a training domain is associated with a collection of pairs of examples that share the same semantic information.  Such {\em semantic sharing (SS) pairs} can be created effortlessly using existing data augmentation (DA) methods, as demonstrated by the examples in Figure \ref{fig:ss-pairs}.
Given a collection of SS pairs, the task is then to use them to reduce the dependence on spurious correlations.\footnote{\footnotesize At a high level of abstraction, 
this task is related to large language model (LLM) alignment where a collection of preference pairs is used to align an LLM to human intent~\citep{ouyang2022training,rafailov2023direct}.
{\em In both tasks, the pairs contain information about ideal model behavior that is absent from the training data.  }
In this sense, one might say that what SS pairs is to domain generalization that preference pairs are to LLM alignment.}
There are several previous DG methods that exploit SS pairs for this purpose~\citep{hendrycks2020augmix, mitrovic2021representation, heinze2021conditional, mahajan2021domain, robey2021model, ouyang2021causality, wang2022out}. They leverage SS pairs via \emph{consistency regularization} (CR), a technique proposed in the semi-supervised learning literature to encourage similar predictions on similar inputs~\citep{bachman2014learning, zhang2020consistency, chen2020simple, caron2021emerging}.
One drawback they share is that they regard an SS pair $(\x, \tilde{\x})$ as {\em unlabeled} and assume $\x$ and $\tilde{\x}$ contain the same semantic information for {\em all classes}.
As illustrated in Figure \ref{fig:ss-pairs}, however,   an SS pair is often created to preserve the semantic information of {\em one particular class}, and is hence  {\em labeled}.
In this paper, we mainly study the use of labeled SS pairs for domain generalization.



We make three contributions in this paper:
1). {We present a theory to motivate the use of SS pairs for optimal domain generalization through causally invariant prediction};
2). We propose a novel method called {Logit Attribution Matching} (LAM) that leverages labeled SS pairs; 
3). We empirically demonstrate the advantages of LAM
 over representative single-source and multi-source DG methods, as well as various CR methods that leverage unlabeled SS pairs.

LAM consistently outperforms previous methods across multiple benchmarks. 
Take the {\iWildCam} dataset~\citep{koh2021wilds} as an example.
ERM achieves $30.2\%$ OOD (Macro F1) score on an ImageNet pretrained ResNet-50 model~\citep{he2016deep}.  
The score increases to $33.8\%$ when the augmented examples 
created by RandAugment~\citep{cubuk2020randaugment} are simply added to the training set.  It further increases to $36.4\%$ when LAM is applied to the resulting SS pairs.  
For the augmented examples created by a more sophisticated data augmentation method~\citep{gao2023out}, the OOD score is $36.5\%$ when the augmented examples are simply added to the training set.
It further increases to $41.2\%$ when LAM is applied to the resulting SS pairs.   In this case, the OOD performance increases by $41.2 - 30.2 = 11\%$, with $41.2 - 36.5 = 4.7\%$ due to the exploitation of SS pairs.
{On CLIP ViT-L/14@336~\citep{radford2021learning}, LAM improves the state-of-the-art fine-tuning method from $47.1\%$ to $48.7\%$.}
It is hoped that our work can inspire the development of better SS pair creation methods so as to further boost OOD  performance of models.
%In this case, the {\em effective robustness } score~\citep{taori2020measuring} achieved by DA is  36.5 - 30.2 = 6.3\% , and that by LAM is 41.2 - 30.2 = 11\%.  ({\bf This calculation is incorrect}.  Taori:  effective robustness, i.e., the intervention should improve the accuracy under distribution shift beyond what is predicted by the linear fit)
%In contrast, a recent method called FLYP~\cite{Goyal_2023_CVPR}
%achieved 46.0 - 41.4 = 5.4\% using a CLIP-pretrained ViT-L/14 model.  



\section{Related Work}
\label{sec:related work}

%\subsection{Domain Generalization}
%\label{sec:related DG}
\textbf{Domain generalization (DG)} is a fundamental problem in machine learning and has attracted much attention in recent years.
A large number of methods have been proposed. In this section, we briefly review several representative methods that are frequently used as baselines in the literature.  They are also used in our experiments as baselines.
 
Most DG methods assume access to multiple training domains~\citep{blanchard2011generalizing, muandet2013domain}.
Among those {\em multi-source} methods, Group Distributionally Robust 
Optimization~(GDRO)~\citep{sagawa2019distributionally} 
seeks to minimize the worst-case risk across all possible training domains.
Invariant Risk Minimization~(IRM)~\citep{arjovsky2019invariant} regularizes ERM with a penalty that enforces cross-domain optimality on the classifier.
Variance Risk Extrapolation (V-REx)~\citep{krueger2020outofdistribution}
penalizes the variance of risks in different training domains.
Domain-Adversarial Neural Networks (DANN)~\citep{ganin2016domain} aims at mapping inputs from each training domain to an invariant distribution in the feature space from which the original domains are indistinguishable.

{\em Single-source DG} does not assume access to multiple training domains~\citep{volpi2018generalizing, hendrycks2019benchmarking}.
One of the main approaches to single-source DG is to discover predictive features that are more sophisticated than simple cues spuriously correlated with labels.
Representation Self-Challenging (RSC)~\citep{huang2020self} and Spectral Decoupling (SD)~\citep{pezeshki2021gradient} are two prominent methods in this direction.
SD suppresses strong dependencies of output on dominant features by regularizing the logits. RSC aims to achieve the same goal in a heuristic manner. 
 %At each iteration of training, it mutes the feature units associated with the highest gradients, such that the network is forced to predict the labels through other less salient features.
 Another approach to single-source DG is to simply add
 augmented examples to the training set~\citep{zhang2017mixup, cubuk2020randaugment,gao2023out}. This approach has been shown to improve OOD performance in many cases, because
 data augmentation exposes a model to more feature variations during training and thereby enhances its capability in dealing with novel domains.

	
	
%\subsection{Consistency Regularization}
\paragraph{Consistency regularization (CR) and semantic sharing (SS) pair creation.}
CR encourages a model to make similar predictions on similar inputs.  The idea originated from the semi-supervised learning literature~\citep{bachman2014learning, sohn2020fixmatch}. It is also used in contrastive learning~\citep{chen2020simple} and 
non-contrastive self-supervised learning~\citep{caron2021emerging}.
{In the context of DG, \citet{wang2022toward} conducted a systematic evaluation of various pre-existing CR methods and found that logit matching is most effective with $L^2$-norm (among $L^1$-norm, cosine similarity, etc.).
In addition to logit matching with $L^2$-norm, we study a few other options including novel ones such as target-logit matching and LAM which will be discussed in Section 
\ref{sec:CR-unlabeled}}.
	%\citep{hendrycks2020augmix,mitrovic2021representation,heinze2021conditional,mahajan2021domain,robey2021model,wang2022out}.
	%They differ in their ways to create and use SS pairs. 
	
	
	\begin{figure}[t]
		% \begin{center}
		% 	\includegraphics[width=8.5cm]{figs/fig.model-result-1.png}
		% \end{center}
            \centering
            \adjustbox{min width=0.6\linewidth}{
            \begin{tikzpicture}
                \definecolor{mycolor1}{HTML}{FF1F5B}
                \definecolor{mycolor2}{HTML}{00CD6C}
                \definecolor{mycolor3}{HTML}{009ADE}
                \begin{scope}[every node/.style={minimum size=2em}]
                    \node [font=\large] (xn0) at (0, 0) {$\Xn$};
                    \node [font=\large] (xc0) at (1.6, 0) {$\Xc$};
                    \node [font=\large] (y)  at (3.2, 0) {$Y$};
                    \node [font=\large] (x) at (0.8, -1.3) {$\X$};\
                    % Text Node
                    \draw (0.0, 1.0) node [anchor=north west][inner sep=0.75pt]  [font=\small] [align=left] {\color{mycolor1}$P(\Xc, \Xn)$};
                    \draw (1.3, -1) node [anchor=north west][inner sep=0.75pt]  [font=\small] [align=left] {\color{mycolor2}$P^{*}(X|\Xc, \Xn)$};
                    \draw (2.4, 1.0) node [anchor=north west][inner sep=0.75pt]  [font=\small] [align=left] {\color{mycolor3}$P^{*}(Y|\Xc)$};
                \end{scope}
                \begin{scope}[]
                    \path [dashed,draw=mycolor1] (xn0) edge node {} (xc0);
                    \path [->,draw=mycolor3,>={Stealth[mycolor3]}] (xc0) edge node {} (y);
                    \path [->,draw=mycolor2,>={Stealth[mycolor2]}] (xn0) edge node {} (x);
                    \path [->,draw=mycolor2,>={Stealth[mycolor2]}] (xc0) edge node {} (x);
                \end{scope}
                \begin{scope}[on background layer]
                    \draw[rounded corners=10, fill=black, fill opacity=0.04, draw=none]($(xn0) + (-0.6, -0.25)$) -- ($(xn0) + (-0.7, 0.45)$) -- ($(y) + (0.5, 0.45)$)  -- ($(y) + (0.5, -0.4)$) -- ($(xc0) + (0.5, -0.4)$) -- ($(x) + (0.4, -0.4)$) -- ($(x) + (-0.4, -0.4)$) -- cycle;
                \end{scope}
            \end{tikzpicture}}
		\vspace{0.4cm}
		\caption{\textbf{Causal latent decomposition (CLD) model.} The input of a training example $\X$ is generated from two latent variables
	$\Xc$ and $\Xn$ which may be statistically correlated due to confounders or direct mechanisms between them. The ground-truth label $Y$ is generated from only $\Xc$.
	The mechanisms that generate $\X$ and $Y$ are assumed to be invariant across domains. The corresponding conditional distributions are denoted as $P^*(\X|\Xc, \Xn)$ and $P^*(Y|\Xc)$.
	The joint distribution $P(\Xn, \Xc)$ of the two latent variables may change across domains.
        {We assume $\Xc$ always $d$-separate $Y$ from the other variables.}
    }
		\label{fig.model-result}
  \vspace{-0.2cm}
	\end{figure}
	
	To apply CR in the context of DG, we need semantic sharing (SS) pairs.
	A straightforward way to create SS pairs is to use generic data augmentation (DA) techniques like CutMix~\citep{yun2019cutmix} and RandAugment~\citep{cubuk2020randaugment}.
    {Previous CR methods primarily adopted generic DA techniques~\citep{hendrycks2020augmix, xie2020unsupervised, wang2022toward, chen2022contrastive, jing2023order, berezovskiy2023weight}.}
    SS pairs can also be created/obtained in ways other than conventional DA.
    For example, \citet{gao2023out} explored targeted data augmentation (Targeted DA) which utilizes task-specific domain knowledge to augment data.
    % An input image $\x$ and an augmentation $\tx$ form an SS pair.
    \citet{heinze2021conditional} paired up photos of the same person when analyzing the CelebA dataset~\citep{liu2015deep}.
    For medical images, \citet{ouyang2022causality} created pairs by performing image transformations to simulate different possible acquisition processes.
    Furthermore, in the case of multiple source domains, SS pairs can be learned. 
    \citet{robey2021model} and \citet{wang2022out}  build image-to-image translation networks between domains and use them to create pairs. \citet{mahajan2021domain} propose an iterative algorithm that uses contrastive learning to map images to a latent space, and then match up images from different domains that have the same class label and are close to each other in the latent space.
	


\section{A Causal Theory of Domain Generalization}
\label{sec:causal-model}

In this section, we present a causal theory of domain generalization, which will be used in the next section to motivate methods for leveraging SS pairs.
In the context of DG, a {\em domain} $d$ is defined by a distribution $P(\X, Y)$ over the space of input-label pairs $(\X, Y)$.
We assume the pairs are generated by the causal model shown in Figure \ref{fig.model-result}.

The model first appeared in~\citet{tenenbaum1996separating}, where it is called the {\em style and content decomposition (SCD) model}, and
$\Xc$ and $\Xn$ are called the  {\em content} and {\em style} variables
respectively.
Similar models appeared recently in a number of papers under different terminologies. {\em  The variable $\Xc$ denotes the essential information in an image $\X$ that a human relies on to assign a label $Y$ to the image.}
It is hence said to represent
{causal factors}~\citep{mahajan2021domain,lv2022causality, ye2022ood}, {intended  factors}~\citep{geirhos2020shortcut}, {semantic factors}~\citep{liu2021learning},
{content factors}~\citep{mitrovic2021representation}, and
{core factors}~\citep{heinze2021conditional}.  In contrast,
{\em the variable $\Xn$ denotes the other aspects of $\X$ that are not essential to label assignment.} It
is hence said to represent
{non-causal factors}, {non-intended factors}, {variation factors},  {style factors}, and
{non-core factors}.
As the relationship between $\Xc$ and $Y$ does not change across domains, $\Xc$ is sometimes said to represent
{stable features}~\citep{zhang2021deep},
domain-independent factors~\citep{ouyang2022causality},
and invariant features~\citep{arjovsky2019invariant,ahuja2021invariance}.
In contrast, $\Xn$ is said to represent
non-stable features,
domain-dependent factors,
and spurious features.

The term ``style'' in the SCD model should be understood in a broad sense. In addition to image style,  it also includes factors such as background, context, object pose and so on.  To avoid confusion, we follow ~\citet{mahajan2021domain,lv2022causality} and 
refer to
$\Xc$ and $\Xn$ as the {\em causal and non-causal factors} respectively,
and rename the SCD model as the {\em causal latent decomposition (CLD) model}.



\begin{figure}[t]
\centering
\adjustbox{max width=0.79\linewidth}{
\begin{tikzpicture}[x=0.6pt,y=0.6pt,yscale=-1,xscale=1]
%uncomment if require: \path (0,300); %set diagram left start at 0, and has height of 300

%Shape: Rectangle [id:dp06784414262282845] 
\draw  [fill={rgb, 255:red, 240; green, 236; blue, 236 }  ,fill opacity=1 ] (226,55) -- (382,55) -- (382,211) -- (226,211) -- cycle ;
%Shape: Polygon Curved [id:ds06712316929742368] 
\draw  [draw opacity=0][fill={rgb, 255:red, 74; green, 144; blue, 226 }  ,fill opacity=0.42 ] (307,116) .. controls (332,103) and (380,110) .. (358,116) .. controls (336,122) and (279,150) .. (272,169) .. controls (265,188) and (233,184) .. (260,153) .. controls (287,122) and (282,129) .. (307,116) -- cycle ;
%Shape: Polygon Curved [id:ds3642234239469342] 
\draw  [draw opacity=0][fill={rgb, 255:red, 245; green, 166; blue, 35 }  ,fill opacity=1 ] (268,117) .. controls (205,105) and (221,66) .. (293,105) .. controls (365,144) and (355,151) .. (370,165) .. controls (385,179) and (391,200) .. (365,187) .. controls (339,174) and (331,129) .. (268,117) -- cycle ;
\begin{scope}[>={Stealth[black]},
    every edge/.style={draw=black}]
    %Curve Lines [id:da5201368743735251] 
    \draw[->, to path={-| (\tikztotarget)}]  (440,63) .. controls (391.7,58.13) and (306.39,82.72) .. (287.33,96) ;
    % \draw [shift={(286,97)}, rotate = 320.91] [color={rgb, 255:red, 0; green, 0; blue, 0 }  ][line width=0.75]    (10.93,-3.29) .. controls (6.95,-1.4) and (3.31,-0.3) .. (0,0) .. controls (3.31,0.3) and (6.95,1.4) .. (10.93,3.29)   ;
    %Curve Lines [id:da6892335341578932] 
    \draw[->, to path={-| (\tikztotarget)}]   (448,165) .. controls (397.4,227.63) and (301.92,192.84) .. (275,174) ;
    % \draw [shift={(277,184)}, rotate = 19.29] [color={rgb, 255:red, 0; green, 0; blue, 0 }  ][line width=0.75]    (10.93,-3.29) .. controls (6.95,-1.4) and (3.31,-0.3) .. (0,0) .. controls (3.31,0.3) and (6.95,1.4) .. (10.93,3.29)   ;
\end{scope}
%Shape: Brace [id:dp9829699377024297] 
\draw   (381,50) .. controls (381.03,45.33) and (378.72,42.98) .. (374.05,42.95) -- (315.55,42.57) .. controls (308.88,42.52) and (305.56,40.17) .. (305.59,35.5) .. controls (305.56,40.17) and (302.22,42.48) .. (295.55,42.43)(298.55,42.45) -- (237.05,42.05) .. controls (232.38,42.02) and (230.03,44.33) .. (230,49) ;
%Shape: Rectangle [id:dp2958469323892545] 
\draw   (177,55) -- (431,55) -- (431,211) -- (177,211) -- cycle ;
%Shape: Brace [id:dp86080669122313] 
\draw   (249,217) .. controls (249,221.67) and (251.33,224) .. (256,224) -- (295.5,224) .. controls (302.17,224) and (305.5,226.33) .. (305.5,231) .. controls (305.5,226.33) and (308.83,224) .. (315.5,224)(312.5,224) -- (355,224) .. controls (359.67,224) and (362,221.67) .. (362,217) ;
%Straight Lines [id:da19959265387236447] 
\draw  [dash pattern={on 4.5pt off 4.5pt}]  (249,177) -- (249,211) ;
%Straight Lines [id:da009782669617053319] 
\draw  [dash pattern={on 4.5pt off 4.5pt}]  (363,113) -- (363,211) ;

% Text Node
\draw (128,120) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\displaystyle \mathcal{X}^\mathrm{n}$};
% Text Node
\draw (190,225) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\displaystyle \mathcal{X}^\mathrm{c}$};
% Text Node
\draw (240,5) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\supp[ P^\mathrm{s}( \Xc)]$};
% Text Node
\draw (442,56) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\supp[ P^\mathrm{s}( \Xn ,\ \Xc)]$};
% Text Node
\draw (444,140) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\supp[ P^\mathrm{t}( \Xn ,\ \Xc)]$};
% Text Node
\draw (240,235) node [anchor=north west][inner sep=0.75pt]  [font=\Large] [align=left] {$\supp[ P^\mathrm{t}( \Xc)]$};
\end{tikzpicture}}
\caption{\small An illustration of conditions for optimal DG under the CLD model.
Training examples $\x$ are sampled from
the latent space, $\mathcal{X}^\mathrm{c} \times \mathcal{X}^\mathrm{n}$, which we depict as a 2-D box.  A prediction model is causally invariant if it makes the same prediction for examples sampled from the same ``vertical line'' in the latent space. If such a model also minimizes the cross-entropy loss of a source domain, then it makes optimal predictions on all examples $\tilde{\x}$ sampled from $\supp[P^\mathrm{s}(\Xc)]\times \mathcal{X}^\mathrm{n}$ (the inner rectangle), not only those from 
$\supp[P^\mathrm{s}(\Xc, \Xn)]$.
This enables optimal generalization
to any target domain $P^\mathrm{t}$ such that $\supp[P^\mathrm{t}(\Xc)] \subseteq \supp[P^\mathrm{s}(\Xc)]$.  }
    \label{fig.model-conditions}
% \vspace{-0.1cm}
\end{figure}

To ground the CLD model, we need to specify three distributions: $P(\Xc, \Xn)$, $P^*(\X|\Xc, \Xn)$ and $P^*(Y|\Xc)$.  Together, the three distributions define a joint distribution over the four variables:
\begingroup
    \small%
    \begin{equation*}
        P(\Xc, \Xn, \X, Y) = P(\Xc, \Xn)P^*(\X|\Xc, \Xn)P^*(Y|\Xc).
    \end{equation*}
\endgroup
This joint distribution defines a domain in the CLD framework.  We refer to the collection of all such domains for some fixed $P^*(\X|\Xc, \Xn)$ and $P^*(Y|\Xc)$ as a {\em CLD family}.
	
	
	Let $\mathcal{X}^\mathrm{c}$ and $\mathcal{X}^\mathrm{n}$  be the sets of all possible values of
	the latent variables $\Xc$ and $\Xn$ respectively.
	Consider an example $\x$ generated by  $P^{*}(\X|\Xc, \Xn)$  from a pair of values~\footnote{\footnotesize We use upper case letters to denote variables and lower case letters to denote their values. We use $P$ with variables, e.g., $P(\Xc)$, to denote a distribution; and $P$ with variable values, e.g., $P(\Xc=\xc)$, to denote a probability value. We may omit the variables if the context is clear, e.g., we may write $P(\Xc=\xc)$ as $P(\xc)$.} $(\xc, \xn) \in \mathcal{X}^\mathrm{c} \times \mathcal{X}^\mathrm{n}$. Let $\tilde{\x}$ be another example sampled from the same $\xc$ and a different $\tilde{\x}^\mathrm{n}$.  %Formally,
	% \begin{eqnarray}
	% 	\label{eq.2examples}
	% 	\x \sim P^{*}(\X|\Xc=\xc, \xn),\quad
	% 	\tilde{\x} \sim P^{*}(\X|\Xc=\xc, \tilde{\x}^\mathrm{n}).
	% \end{eqnarray}
	%	\vspace{-5mm}
	The two examples ${\x}$ and  $\tilde{\x}$ contain the same semantic contents and hence should be classified into the same class.
	In this sense, $\x$ and $\tilde{\x}$  make up
	a  {\em semantic sharing (SS) pair}.
	Let $\hat{P}_{\theta}(\hat{Y}|\X)$ be a {\em prediction model} with parameters $\theta$.
	It is said to be  {\em causally invariant} if
	\begin{equation}
		\label{eq.c-invariant}
		\hat{P}_{\theta}(\hat{Y}|\X=\x) = \hat{P}_{\theta}(\hat{Y}|\X=\tilde{\x}),
	\end{equation}
	for all SS pairs  $(\x, \tilde{\x})$. 
	In other words, the prediction output does not change in response to variations in the non-causal factors $\Xn$ as long as the causal factors $\Xc$ remain fixed.
        {Such causal invariance is a key condition for optimal DG.}
	\begin{theorem}
		\label{theo.ID-optimal}
		{\bf (Conditions for Optimal DG)}
		Let  $\hat{P}_{\theta}$ be a prediction model for a CLD family {such that different $\xc$ almost always generate different $\x$}, and let $P^\mathrm{s}$ and $P^\mathrm{t}$ be a source and a target domain (from the family) such that $\supp[P^\mathrm{t}(\Xc)] \subseteq \supp[P^\mathrm{s}(\Xc)]$.
        Suppose:
  \begin{itemize}
		\item[1).] $\hat{P}_{\theta}$ minimizes the  in-distribution (ID) cross-entropy loss
		$\ell_{\mathrm{s}}(\hat{P}_{\theta})  = \mE_{(\x, y) \sim P^\mathrm{s}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\x) ]$;  
		\item[2).] $\hat{P}_{\theta}$ is causally invariant. 
  \end{itemize}
		Then, the prediction model
		$\hat{P}_{\theta}$
		also minimizes the out-of-distribution (OOD) cross-entropy loss: $$\ell_{\mathrm{t}}(\hat{P}_{\theta})= \mE_{(\x, y) \sim P^\mathrm{t}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\x) ].$$  In other words, it generalizes optimally to the target domain.
	\end{theorem}
	The proof of this theorem can be found in Appendix \ref{proof}.  Closely related theoretical results~\citep{peters2016causal, arjovsky2019invariant, mahajan2021domain} are discussed in Appendix \ref{theory-related}.
 % Note that the first condition requires
	% $\ell_{\mathrm{s}}(\hat{P}_{\theta})$ be minimized not only with respect to the parameters
	% of the prediction model  $\hat{P}_{\theta}$, but also its architecture.  
    The support
	$\supp[P(\Xc)] = \{\xc \in \mathcal{X}^\mathrm{c} \mid P(\xc)>0\}$ consists of all causal factors that appear in a domain $P$. 
    {The assumption on the support between $P^\mathrm{s}$ and $P^\mathrm{t}$ can be relaxed if we consider approximately optimal DG. We opt for simplicity here since it is not pertinent to the focus of this paper.
    More importantly, the second condition on $\hat{P}_\theta$ connects consistency regularization (CR) with DG.}

%{Acknowledge the unverifiable nature of the causal-invariant condition here?}

The intuition behind Theorem \ref{theo.ID-optimal} is illustrated in Figure~\ref{fig.model-conditions}.
{In short, Theorem \ref{theo.ID-optimal} articulates a set of sufficient conditions for optimal DG. While the causal invariance condition is difficult to verify or fully attain in practice, it can still guide the development of practical DG algorithms. We next discuss CR methods that can bring the model closer to meeting the causal invariance condition.}
 
\section{Consistency Regularization for Domain Generalization}
\label{sec:CR}

%The second condition (the causal-invariant condition) in it connects the use of SS pairs for DG. 

{Intuitively, one can make a model more causally invariant by encouraging the model to yield invariant predictions for SS pairs sharing the same $\Xc$.}
So, here is the problem we address in this paper:
\begin{quote}
\leftskip=-0.5em
\rightskip=-0.5em
{\em 
Given a source domain $P^\mathrm{s}$ from a CLD family and a set of labeled SS pairs 
$\{(\x_i, \tilde{\x}_i; y_i)\}_{i=1}^N$,
learn a prediction model   $\hat{P}_{\theta}(Y|X)$ that performs well in any target domain {$P^\mathrm{t}$} from the same CLD family.
}
\end{quote}
Recall that a CLD family consists of all the domains defined by the causal model in Figure \ref{fig.model-result} with fixed 
$P^*(X|\Xc, \Xn)$ and $P^*(Y|\Xc)$.  


\subsection{CR with Unlabeled SS Pairs}
\label{sec:CR-unlabeled}

% Note that the prediction output $\hat{P}_{\theta}(\hat{Y}|\X=\x)$ refers to the probability distribution over all classes, and hence the SS pairs here are unlabeled.

Let us first consider the case where we have a set of \emph{unlabeled} SS pairs $\{(\x_i, \tilde{\x}_i)\}_{i=1}^N$.
{The distinction between labeled and unlabeled SS pairs is if the semantic information is invariant for just one particular class or all classes,
% class labels are required and utilized by the consistency regularizer, 
\emph{not} whether the original examples $x_i$ is labeled.}
Unlabeled SS pairs contain stronger information than labeled SS pairs:  two examples $\x_i$ and $\tilde{\x}_i$ contain the same semantic information for all classes implies that they contain the same semantic information for every class.

With unlabeled SS pairs, the first two conditions of Theorem~\ref{theo.ID-optimal} can be approximately satisfied by solving the following constrained optimization problem:
\begingroup
    % \small%
    \begin{align*}
        \min_{\theta} \hspace{0.8em} &\mE_{(\x, y) \sim P^\mathrm{s}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\x) ] \\
        \mathrm{subject\ to} \hspace{0.8em} &\hat{P}_{\theta}(\hat{Y}|\X=\x_i) = \hat{P}_{\theta}(\hat{Y}|\X=\tilde{\x}_i), \hspace{0.2cm}
        i \in [N].
    \end{align*}
\endgroup
 Of course, how well the two conditions are actually satisfied depends on how representative the unlabeled SS pairs we have are of all possible SS pairs. 
 
	If we turn the equality constraints into a {\em consistency regularization (CR)} term, the problem becomes:
	%    \begin{small}
		%	\begin{eqnarray}
			%		\label{eq.c-regularization}
			%		\min_{\theta}  \hspace{0.2cm}   \mE_{(\x, y) \sim P^\mathrm{s}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\X=\x) ] + 
			%		\frac{\lambda}{K} \sum_{k=1}^K r_{\theta}(\x_i, \tilde{\x}_i),
			%	\end{eqnarray}	
		%    \end{small}
	%\begin{small}
	\begin{equation*}
		\label{eq.c-regularization}
		\min_{\theta} \hspace{0.4em} \mathbb{E}_{(\x, y) \sim P^\mathrm{s}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\x) ] + \lambda \mathbb{E}_{i} [r_\theta(\x_i, \tilde{\x}_i)],
	\end{equation*}
	%\end{small}
	where $\lambda$ is a balancing parameter and the summation over $r_\theta(\x_i, \tilde{\x}_i)$ is a regularization term that relaxes the corresponding equality constraints.
 


Some notations are needed in order to discuss specific choices for $r_\theta$.
	Suppose  $\hat{P}_{\theta}$ consists of a feature extractor $f_{\phi}$ with parameters $\phi$  and a linear classification head  $g_{\w}$ with parameters
	$\w$.  Hence, $\theta = (\phi, \w)$.
	For an input $\x$,  let $f_{\phi}^u(\x)$ be the component of the feature vector $f_{\phi}(\x)$  for a feature unit $u$. Let $w_{uy}$ be the
	weight between a feature unit $u$ and the output unit for a class $y$.
	The logit for class $y$ is
 $$z_{\theta}^y(\x) = \sum_u  f_{\phi}^u(\x)w_{uy}, $$	
% \footnote{\scriptsize Assume the bias is represented by a dummy unit.}
where the summation is over all feature units $u$ and the bias is omitted.


	For each unlabeled SS pair $(\x_i, \tilde{\x}_i)$,  the CR term 
	$r_\theta(\x_i, \tilde{\x}_i)$ can be defined in several ways: 
	\begin{align*}
	    r_\theta^{\texttt{KL}}(\x_i, \tilde{\x}_i) &=  {D}_\texttt{KL}\infdivx{\hat{P}_{\theta}(\hat{Y}|X=\x_i)}{\hat{P}_{\theta}(\hat{Y}|X=\tx_i)},\\
 r_\theta^{\texttt{JS}}(\x_i, \tilde{\x}_i) &=  {D}_{\texttt{JS}}\infdivx{\hat{P}_{\theta}(\hat{Y}|X=\x_i)}{\hat{P}_{\theta}(\hat{Y}|X=\tx_i)},\\
 r_\theta^{\texttt{LM}}(\x_i, \tilde{\x}_i) &=  \sum\nolimits_{y} \big[z_{\theta}^y(\x_i) - z_{\theta}^y(\tx_i)\big]^2, \\ 
		r_\theta^{\texttt{FM}}(\x_i, \tilde{\x}_i) &=   \sum\nolimits_{u} \big[f_{\phi}^u(\x_i) - f_{\phi}^u(\tx_i)\big]^2.
	\end{align*}
	The first two terms aim to match the output probability distributions of $\x_i$ and $\tilde{\x}_i$ by minimizing either the 
 KL or JS divergence between them.  The third term aims to match their logit vectors, and the fourth term aims to match their feature vectors.  They are used in previous methods \ReLIC~\citep{mitrovic2021representation},
	\AugMix~\citep{hendrycks2020augmix}, \CoRE~\citep{heinze2021conditional}, and
	MatchDG~\citep{mahajan2021domain} respectively.
	Note that while we focus on pairs for simplicity, logit and feature matching can also be extended to the case of multiple examples that share the same semantic contents. To achieve this, we can simply replace the sum of squared differences with the sum of variances. This is done in \CoRE and MatchDG.
	
 \subsection{CR with labeled SS Pairs}
	\label{sec:CR-labeled}
	
	Now consider the case where we have a set of labeled SS pairs
	$\{(\x_i, \tilde{\x}_i; y_i)\}_{i=1}^N$.
	Here, each pair $\x_i$ and $\tilde{\x}_i$ share the same semantic information only for the class $y_i$. It is no longer justifiable to match all the features,  logits or probabilities of all classes. In the following, we propose three methods for leveraging labeled SS pairs.
 
 First, we can match the probabilities or logits of the target class $y_i$ only, leading to 
 what we call 
  \emph{target probability matching (TPM)} and {\em target logit matching (TLM)}:
\begin{align*}
	   r_\theta^{\texttt{TPM}}(\x_i, \tx_i; y_i) &= \big[\hat{P}_{\theta}(\hat{Y}=y_i|\x_i) - \hat{P}_{\theta}(\hat{Y}=y_i|\tx_i)\big]^2, \\
    r_\theta^{\texttt{TLM}}(\x_i, \tx_i; y_i) &= \big[z_{\theta}^{y_i}(\x_i) - z_{\theta}^{y_i}(\tx_i)\big]^2.
\end{align*}

To introduce the third method, note that $f^u_{\phi}(\x_i)w_{uy_i}$ is the contribution to the logit  $z_{\theta}^{y_i}(\x)$ of $y_i$ from the feature unit $u$. We can match the logit contributions $f^u_{\phi}(\x_i)w_{uy_i}$ and $f^u_{\phi}(\tx_i)w_{uy_i}$ from all feature units $u$ to $y_i$.  This gives rise to
  {\em logit attribution matching (LAM)}:
  \begin{equation*}
			\label{eq:lam}
			r_\theta^{\texttt{LAM}}(\x_i, \tilde{\x}_i; y_i)=
              \sum_u\big[ f_{\phi}^{u}(\x_i)w_{uy_i} -  f_{\phi}^u(\tx_i)w_{uy_i}\big]^2.
    \end{equation*}
LAM is of finer grain than TLM.
Small $r_\theta^{\texttt{LAM}}$ implies small $r_\theta^{\texttt{TLM}}$, but not vice versa:
{\begin{equation*}
    \begin{split}
        r_\theta^{\texttt{LAM}}(\x_i, \tilde{\x}_i; y_i)
        &\geq \frac1m\Big[\sum_u f^u_\phi(x_i){{w_{uy_i}}}-\sum_u f^u_\phi(\tilde{x}_i){{w_{uy_i}}}\Big]^2 \\
        &= \frac1m r_\theta^{\texttt{TLM}}(\x_i, \tx_i; y_i),
    \end{split}
\end{equation*}
where $m$ is the number of feature units.}
Also, note that
\[r_\theta^{\texttt{LAM}}(\x_i, \tilde{\x}_i; y_i)=\sum_{u} 
			\big[f_{\phi}^{u}(\x_i) - f_{\phi}^u(\tx_i)\big]^2w^2_{uy_i}.\]
Hence, LAM  exerts  two complementary regularization forces, one on  $g_{\w}$ and the other on $f_{\phi}$:
\bit
	\item[1).] It encourages the classification head $g_{\w}$ to put large weights $|w_{uy_i}|$ on the  feature units $u$ where the  values of $\x_i$ and $\tx_i$ are similar, i.e.,  $f_{\phi}^u(\x_i) \approx f_{\phi}^u(\tx_i)$. In other words,  {\em it makes $g_{\w}$ rely on the feature units that reflect the common information contents of  $\x_i$ and $\tx_i$.}
 \item[2).] It encourages the feature extractor $f_{\phi}$ to make
$f_{\phi}^u(\x_i) \approx f_{\phi}^u(\tx_i)$  for  those feature units $u$ that $g_{\w}$ relies on heavily, i.e.,   with  large weights $|w_{uy_i}|$.
In other words, {\em it encourages  $f_{\phi}$ to channel 
the common information contents of  $\x_i$ and $\tx_i$ toward the units that $g_{\w}$ considers important}.
\eit
As $\x_i$ and $\tx_i$ share the causal factors for class $y_i$ but not the non-causal factors, those forces help a model focus more on the causal factors.



\section{Experiments}
\label{experiments}

A direct way to use augmented examples is to add them to the training set and train a model on the combined data using ERM.  We denote this approach as ERM+DA. 
Alternatively, we can pair them up with the original images and apply  CR methods on the resulting SS pairs. 
The main objective of our empirical studies is to compare LAM with ERM+DA, with ERM itself as a baseline.
We also compare LAM with TPM and TLM, as well as previous CR methods.


Another way to utilize the augmented examples is to run a single-source DG algorithm on the combined data.
It is also possible to treat the augmented examples as a separate domain and run a multi-source DG algorithm.  We further compare LAM with representative single-source and multi-source DG methods in those settings. 

Additionally, we assess the impact of the quality and quantity of augmented examples. We consider examples from two DA methods. 
The first one is RandAugment~\citep{cubuk2020randaugment}. It creates augmented examples by applying a random set of 
transformations such as resizing, rotating, and color jittering to original images.
The second method is Targeted DA~\citep{gao2023out}.
It aims to randomize spurious factors while preserving robustly predictive factors.  The specific designs of Targeted DA vary across datasets.  Targeted DA generally yields more informative SS pairs infused with more specific domain knowledge.
We call examples produced from Targeted DA \emph{target-augmented examples}.
% We apply it to a fraction of the training examples on some datasets. 



\input{Table1}

\begin{figure*}[t]
	\begin{center}
		\includegraphics[width=1\textwidth]{figs1/gradcam_new_uai_compressed.png}
	\end{center}
	\vspace{-4mm}
	\caption{Grad-CAM saliency maps for the top predicted class by models trained on ImageNet-9 using various methods. The model learned using LAM focuses on the foreground objects better.}
	\label{fig:feature_map}
	 % \vspace{-1mm}
\end{figure*}



% \vspace{-2mm}
\subsection{Datasets}
\label{sec:datasets}

Our experiments involve five DG datasets, three with background shifts and two with style shifts. 

%\subsubsection{Background Shift} 

\textbf{\iWildCam (iWildCam)}~\citep{beery2020iwildcam,koh2021wilds} consists of camera trap photos of animals taken at different locations for wildlife classification.  The training domain comprises images from 200 locations, while the test and validation domains contain images from some other locations. Targeted DA is performed by Copy-Paste the animals in a training image to another image (with no animal) taken at a different location where the same animals sometimes appear~\citep{gao2023out}.


\textbf{ImageNet-9}~\citep{xiao2020noise} includes images of nine coarse-grain classes from ImageNet~\citep{deng2009imagenet}. Several synthetic variations are created by segmenting the foreground of each image and place it onto a different background.
In our experiments, the synthetic images with a black background are used as target-augmented examples. For the test domain, we use the samples where the foreground of an original image is placed onto the background of a random image.

 
\textbf{NICO}~\citep{he2020towards} includes around 25,000 images across 19 classes of animals or vehicles in different contexts such as ``at home'' or ``on the beach''. As there is no predefined train-test split, we randomly select one context per class for testing and use the remaining contexts for training. Target-augmented training examples and test domains are created in a way similar to ImageNet-9. 


\textbf{\Camelyon (Camelyon)}~\citep{tellez2018whole, koh2021wilds} contains histopathology images from multiple hospitals for binary tumor classification. Images from three hospitals are used for training, while images from two other hospitals are used for testing and validation respectively. There are stylistic variations among images from different hospitals. One key stylistic difference often observed is the stain color. Therefore, the stain color jitter is applied to training images to create {target-augmented examples}~\citep{gao2023out}. 
%The jitter can effectively randomize the average stain level of images. 



\textbf{PACS} ~\citep{li2017deeper} contains images of objects and creatures in four different styles: \emph{photo}, \emph{art}, \emph{cartoon} and \emph{sketch}.  Following common practice~\citep{li2017deeper, gulrajani2021in}, we train 
a model using three of the domains and test the model on the fourth domain. 
For Targeted DA, we apply Stable Diffusion~\citep{rombach2022high} to
images in the \emph{photo} domain to create target-augmented examples in the other three domains.
The {\em photo} domain is therefore not used as the test domain, while the other three domains are used as the test domain in turn. 
See Appendix \ref{creation} for details.


For all datasets, RandAugment~\citep{cubuk2020randaugment} is performed on all training examples.
Targeted DA~\citep{gao2023out} is also performed on all training examples in iWildCam and Camelyon.  However, it is performed on only about 5\% of the training data in ImageNet-9 and NICO, and about 10\% of the training data for PACS.

All CR methods have a balancing parameter  $\lambda$, which is tuned on the validation domain for iWildCam and Camelyon, and on a test set from the training domain for the other three datasets.
For CR and single-source methods, multiple training domains are simply combined into one.
More details on how the training data are organized for different types of methods can be found in Table~\ref{tab:data summary} (Appendix~\ref{imple}).

%\vspace{-1mm}
\subsection{Network Architecture and Weight Initialization} 
\label{sec:model}

% In our experiments, we start with pretrained models and finetune them on the aforementioned datasets.
Following \citet{gao2023out}, we use a variety of models for different datasets. Specifically, we use an ImageNet pretrained ResNet-50 model~\citep{he2016deep} for iWildCam, and a randomly initialized
DenseNet-121 model~\citep{huang2017densely} for Camelyon.
We use a CLIP-pretrained ViT-B/16 model~\citep{radford2021learning}  for ImageNet-9 and NICO, and a CLIP-pretrained ResNet-50 model for PACS.   

{To showcase the combined use of LAM with advanced CLIP model fine-tuning method can yield SOTA-level performance on iWildCam, we also employ CLIP-pretrained ViT-L/14 and ViT-L/14@336 model for iWildCam.}

The use of various model architectures and weight initializations allows us to assess the relative merits of DG algorithms on a mixture of datasets and models.
Implementation details about hyperparameters for each dataset and method can also be found in Appendix \ref{imple}.


\input{Table2}


%\vspace{-1mm}
\subsection{Comparison with ERM+DA}
\vspace{-1mm}
Table \ref{tab:main} shows the results for LAM, ERM+DA, and ERM.  We see that simply adding augmented data to the training set (ERM+DA) increases the average OOD score from 71.4\% to 76.5\% with RandAugment~\citep{cubuk2020randaugment}, and to 78.7\% with Targeted DA~\citep{gao2023out}.  {\em Applying LAM on the resulting SS pairs further increases the scores
to 78.2\% and 81.1\% in the two cases respectively.  }
 In the case of Targeted DA, the average OOD score on those five benchmarks is improved by 81.1-71.4 = 9.7\%, with 78.7-71.4 = 7.3\% due to data augmentation and 81.1-78.7 = 2.4\% due to LAM. 
The improvements are especially pronounced on the iWildCam and Camelyon datasets, where Targeted DA increases the OOD scores drastically.
This is consistent with what was reported in~\citet{gao2023out}.
LAM further improves the scores by 4.7\% and 3.0\% respectively.



While trying to gain some insights, we find that LAM makes a model focus on much fewer feature units (see  Figure \ref{fig:weight distribution}
in Appendix
\ref{vis}) as compared with ERM+DA. We also use an XAI method called Grad-CAM~\citep{selvaraju2017grad} to explain the outputs of the model trained on ImageNet-9 by LAM and they some other methods.  Examples are shown in 
Figure \ref{fig:feature_map}  (and 
Figure \ref{fig:more_saliency_map}
in Appendix
\ref{vis}).   We see that, in all those examples, the LAM model focuses on the foreground objects and gives the correct predictions.  Those corroborate with the analysis we make at the end of Section 
\ref{sec:CR-labeled}.
In contrast, the ERM+DA model is more inclined to focus 
on the wrong part of an input image and predict incorrectly. 



{In addition to comparing LAM over the traditional ERM which is based on the standard cross-entropy loss, it has been shown in \citet{goyal2023finetune} that when fine-tuning CLIP models, the use of  CLIP contrastive loss with utilizing the CLIP text encoder is more effective. The proposed method is colloquially known as ``finetune like you pretrain'' (FLYP). In Table \ref{tab:flyp}, we show that the use of LAM can also yield improved OOD performance over FLYP+DA.} 

%FLYP+DA+LAM also outperforms the methods that ensemles , Model Soup \citep{wortsman2022model} and WiSE-FT \citep{wortsman2022robust}.    }

%\vspace{-7.5mm}
\subsection{Impact of Quality and Quantity of Augmented Examples}
\label{impact_qq}
%\vspace{-0.5mm}
Both LAM and ERM+DA achieve better results with Targeted DA than with RandAugment.  We believe this is because 
Targeted DA generally yields higher quality augmentations than the latter.  To further support the claim, we perform additional experiments with ImageNet-9 in the Targeted DA setting.  Specifically, we test three different ways to create augmented examples: 
1).  use a segmentation method called  GrabCut~\citep{rother2004grabcut}, 2). use another less effective segmentation method called FCN~\citep{long2015fully}, and 3). simply use bounding boxes that come with ImageNet-9 (Box).
% which are inferior to FCN.
The resulting OOD scores are as follows: 

\vspace{-1mm}
\bc
\setlength{\tabcolsep}{5pt}
\begin{tabular}{cccccc}
%			\toprule
%		 \multicolumn{7}{c}{\scriptsize{Tab. 3-1: \emph{OOD performance}  vs. \emph{pair quality}}}\\
\toprule
	 \multicolumn{2}{c}{{Box}} &  \multicolumn{2}{c}{{FCN}} &  \multicolumn{2}{c}{{GrabCut}} \\
\cmidrule(lr){1-2} \cmidrule(lr){3-4} \cmidrule(lr){5-6}
 	\small{ERM+DA} & \small{LAM}  &  	\small{ERM+DA} & \small{LAM} &  	\small{ERM+DA}  & \small{LAM}\\
\midrule
85.2 & 85.9  & 83.9& 86.6& 86.0& 88.1\\
\bottomrule
\end{tabular}
\ec


%\vspace{-2mm}

We see that, as expected, the results with GrabCut are the best, followed by those with FCN and Box, in that order. 

We also perform additional experiments with ImageNet-9 to investigate how the quantity of augmented examples influences LAM.  Specifically, GrabCut is applied to different percentages of the training examples and LAM is run on the resulting SS pairs. To make a comparison, we do the same thing for the ERM+DA.
The resulting OOD scores are as follows:

%\vspace{-2mm}
\bc
\begin{tabular}{cccccc}
%				\toprule
%			\multicolumn{7}{c}%{\scriptsize{Tab. 3-2: \emph{OOD performance}  vs. \emph{\# pairs} for LAM}} \\
\toprule	
  &5\% & 10\% &20\% & 50\% &100\%  \\
\midrule				
ERM+DA  &86.0  &86.9 &86.1 &87.4 &87.8 \\
LAM   &88.1  &88.5 &88.6 &89.7 &90.4 \\
\bottomrule	
\end{tabular}
\ec

%\vspace{-2mm}

It is clear that the increase in the quantity of SS pairs benefits  LAM, and the availability of SS pairs for a small fraction of training examples can significantly improve OOD performance already. While providing more SS pairs can also improve the performance of ERM+DA, it is obvious that the improvement is smaller than that of LAM.

\input{Table3}

\input{Table4}


\subsection{Comparison with Other CR Methods}

Table \ref{tab:CR_comparison} shows the results for LAM and other CR methods. 
Let us first compare LAM and two other CR methods we propose in this paper, namely target probability matching (TPM) and target logit matching (TLM).
We see that LAM achieves higher OOD scores than the other two methods on average, and it outperforms ERM+DA in all cases while the other two methods do not.  Those show that {\em when making use of labeled SS pairs, it is more effective to apply consistency regularization to the logit contributions of the target classes (LAM) rather than the logits themselves (TLM) or the probabilities of the target classes (TPM).  }

Next, we compare LAM with previous strong CR methods, namely probability matching with KL or JS, logit matching (LM) and feature matching (FM).
LAM achieves higher OOD scores than those methods on average. Moreover, it achieves the highest score in all cases except for Camelyon with Targeted DA.  Moreover, 
it outperforms ERM+DA in all cases, while the other methods do not.
Those show that {\em it is generally beneficial to regard SS pairs created using both Targeted DA and RandAugment as labeled and apply LAM on them, rather than considering them unlabeled and applying any of the previous CR methods on them}. 

In LAM, a labeled SS pair $(\x_i, \tx_i; y_i)$ is used only to regularize the contributions from feature units to the logit of {the ground-truth class $y_i$}. It does not impact the other classes. In the previous CR methods, on the other hand, the pair is used to regularize the entire feature, logit, or probability vector for $\x_i$. It affects other classes as well as $y_i$. This is problematic when a training example $\x_i$   contains multiple objects of interest. Some objects that appear in the background of the main object in $\x_i$ might be removed during data augmentation. In such a case, the features of those minor objects would be suppressed.
To further demonstrate the adverse consequences, we created a variant of the iWildCam dataset~\citep{beery2020iwildcam, koh2021wilds} by adding a small segmented image of another animal to the background of each image. The new dataset is named \textbf{iWildCam-N} (examples of this dataset are given in Appendix \ref{more_dataset}).  On this dataset, LAM still improves over ERM+DA. However, the performances of all four previous methods are substantially worse than that of ERM+DA. 

Camelyon is a binary classification problem. There is no issue of suppressing features of other classes.  This is probably why probability matching with JS is superior to LAM on  Camelyon in the case of Targeted DA.

% \vspace{-1mm}
\subsection{Comparison with Other DG Methods}

Table \ref{tab:results_on_shift} shows the OOD performances of LAM with  six representative single-source and multi-source DG methods reviewed in Section 
\ref{sec:related work}.
Here only Targeted DA~\citep{gao2023out} is considered. 
On the first four datasets, LAM outperforms all the six DG methods on average. In particular, it outperforms them by large margins on iWildCam.
While LAM improves over ERM+DA on all the first four datasets, the other methods are inferior to ERM+DA in the majority of the cases.
On the binary classification dataset Camelyon, however, LAM is on par with SD~\citep{pezeshki2021gradient}, but it still outperforms ERM+DA.


Recall that augmented examples are simply added to the training set
 for the single-source methods (RSC~\citep{huang2020self} and SD), and they are treated
 as an additional training domain for the multi-sources methods (DANN~\citep{ganin2016domain}, GDRO~\citep{sagawa2019distributionally}, IRM~\citep{arjovsky2019invariant} and V-REx~\citep{krueger2020outofdistribution}).  In contrast, LAM applies consistency regularization 
 on the resulting SS pairs.   
 The results in Table \ref{tab:results_on_shift} show that {\em consistency regularization with LAM is a more effective way to use augmented examples than representative previous single-source and multi-source DG methods.}
 
%Figures \ref{fig:feature_map}  and
%Figure \ref{fig:more_saliency_map}



\section{Conclusion}
In this paper, we study the setting where a training domain is associated with a collection of example pairs that share the same semantic information.  We present a theory to motivate using such semantic sharing (SS) pairs to boost model robustness under domain shift.
% One way to create SS pairs is to use a data augmentation method.
We find that applying consistency regularization (CR) on the SS pairs, particularly using LAM, significantly improves OOD performance compared to simply adding the augmented examples to the training set. An interesting future direction is to develop more efficient methods for creating more informative SS pairs, e.g., by leveraging advances in generative models. We hope our work could encourage more efforts in manually creating SS pairs for domain generalization, similar to the collection of human preference pairs for LLM alignment.

\section*{Acknowledgement}

We thank the deep learning computing framework MindSpore
(\href{https://www.mindspore.cn}{https://www.mindspore.cn}) and its team for the support on
this work. Research on this paper was supported in part by
Hong Kong Research Grants Council under grant 16204920.
Kaican Li and Weiyan Xie were supported in part by the Huawei PhD Fellowship Scheme. 

\newpage

% References
\bibliography{main}



\onecolumn



%This Supplementary Material should be submitted together with the main paper.

\appendix
% \title{Appendices}
% \maketitle



% \vspace{-3cm}
\input{appendix}

 \end{document}