\section{Proofs}
\label{proof}

{\bf Proof of Theorem 	\ref{theo.ID-optimal}}:
Let us start with the ID cross-entropy loss:
\begin{equation*}
    \begin{split}
        \ell_{\mathrm{s}}(\hat{P}_{\theta})  &= \mE_{(\x, y) \sim P^\mathrm{s}} [- \log \hat{P}_{\theta}(\hat{Y}=y|\x) ] \\
	&=-\mE_{(\xc, \xn) \sim P^\mathrm{s}}\mE_{\x \sim P^*(\x|\xc, \xn)}\mE_{y \sim P^*(y|\xc)} [ \log \hat{P}_{\theta}(\hat{Y}=y|\x) ].
    \end{split}
\end{equation*}
 %&=&-\mE_{\xc \sim P^\mathrm{s}(\Xc)} \mE_{y \sim  P^{*}(Y|\xc)} \mE_{\xn \sim P^\mathrm{s}(\Xn|\xc)}\mE_{ \x \sim P^*(\X|\xc, \xn)} [ \log \hat{P}_{\theta}(\hat{Y}=y|\x) ].  
Because $\hat{P}_{\theta}$ is causally invariant, $ \hat{P}_{\theta}(\hat{Y}=y|\x) $ depends only on $\xc$, but not $\xn$. Denote it as 
$Q_{\theta}(\hat{Y}=y|\xc)$. Then, we get
\begin{equation*}
    \begin{split}
        \ell_{\mathrm{s}}(\hat{P}_{\theta})  &= 
 -\mE_{\xc \sim P^\mathrm{s}} \mE_{y \sim  P^{*}(y|\xc)} \mE_{\xn \sim P^\mathrm{s}(\xn|\xc)}\mE_{ \x \sim P^*(\x|\xc, \xn)}
 [\log Q_{\theta}(\hat{Y}=y|\xc)]\\
   &= - \mE_{\xc \sim P^\mathrm{s}}
	\mE_{y \sim  P^{*}(y|\xc)}[\log Q_{\theta}(\hat{Y}=y|\xc)].
    \end{split}
\end{equation*}
As the ID loss $\ell_{\mathrm{s}}(\hat{P}_{\theta})$ is minimized,
the inner expectation is maximized for any $\xc$ such that  $P^\mathrm{s}(\xc)>0$.  
%By Gibbs' inequality, this implies that
%$$Q_{\theta}(\hat{Y}=y|\xc)=P^*({Y}=y|\xc)$$
%for any $\xc$ such that  $P^\mathrm{s}(\xc)>0$ and  any    value %$y$ of $Y$ and $\hat{Y}$.  

Now, consider the  OOD cross-entropy loss $\ell_{\mathrm{t}}(\hat{P}_{\theta})$
of the target domain $P^\mathrm{t}$.  By symmetry, we have:
\begin{equation*}
	\ell_{\mathrm{t}}(\hat{P}_{\theta}) = - \mE_{\xc \sim P^\mathrm{t}}
	\mE_{y \sim  P^{*}(y|\xc)}[\log Q_{\theta}(\hat{Y}=y|\xc)].
\end{equation*}
%
We know from above that the inner expectation is maximized for all $\xc$ such that $P^\mathrm{s}(\xc)>0$.
It is also maximized for any
$\xc$ such that $P^\mathrm{t}(\xc)>0$ because
$\supp[P^\mathrm{t}(\Xc)] \subseteq \supp[P^\mathrm{s}(\Xc)]$.
\hfill$\square$ \\


\section{Related Theoretical Results}
\label{theory-related}
%{\color{blue}  [Kaican]
%\citet{peters2016causal} proposed invariant causal prediction (ICP) as a method for causal inference. ICP posits the existence of some input variables $X^\ast$ such that there is an invariant process to derive $Y$ from $X^\ast$. ICP relies on the diversity of observed domains to infer $X^\ast$ and the invariant process. Given sufficiently diverse domains, it is shown that the inference can be done with high confidence.
% Furthermore, under certain conditions, it is even possible to infer the direct causal parents of $Y$. The $\Xc$ in our framework resembles $X^\ast$ in the sense that they are both the cause of $Y$. However, they are also quite different since $X^\ast$ is always observed whereas $\Xc$ is not. Our goal is to recover $\Xc$ which might be hidden in $X$, and thus more challenging than ICP. In contrast, ICP focuses on the invariance of prediction with respect to observed variables and does not address the problem of hidden causal variables.

% Apart from that, our framework focuses on how domain generalization may be achieved thereafter while \citet{peters2016causal} focused on the sufficient conditions to identify $X^\ast$.
% For this reason, we turn to
% For this reason, we require a stronger condition, namely causal invariance 
% In addition, we do not assume access to data with domain labels.
% Instead, 

%\citet{arjovsky2020out} addresses the limitation of ICP by lifting the assumption of observable $X^\ast$. It is shown that given a data representation $\Phi(X)$, the test-domain error is bounded by the training-domain error and the maximum divergence between the conditional label distributions $P(Y|\Phi(X))$ of two domains among all domains (including the test domains).
%Apparently, a causal representation $\Phi(X) = \Xc$ not only minimizes this divergence but also leads to causally invariant prediction because $P(Y|\Xc)$ is invariant.\citet{arjovsky2020out} aims to learn $\Phi(X) = \Xc$ while we aim to learn $\hat{Y} = Y$.

% The main difference is that \citet{arjovsky2020out} focuses on learning a good representation $\Phi(X)$ from multiple training domains while we focus on learning a good predictor.

%Results from causal representation learning.
%}




%[Nevin's version]=== \\
%	The concept of {\em causally invariant prediction (CIP)}, as defined in Section \ref{sec:causal-model},  should not be confused with a notion described in \citep{peters2016causal} that has a very similar name, {\em invariant causal prediction (ICP)}. In fact, Peters et al.\ are concerned with causal discovery. The problem is to determine, among a set of {\em observed predictor variables} $\{X_1, X_2, \ldots, X_p\}$, the causes of a target variable $Y$.  To solve the problem, they propose to learn {\em multiple models} with different subsets of the predictor variables across multiple domains, and conclude the subset of variables that lead to invariant prediction accuracy across all the domains are the causes of $Y$.In this paper, our objective is to {\em learn a single prediction model} that depends only on the {\em latent variable} $\Xc$ so that it can generalize well to new domains.

The concept of {\em causally invariant prediction (CIP)} that we introduce in Section \ref{sec:causal-model} is closely related to a notion described in \citet{peters2016causal} that bears a very similar name --- {\em invariant causal prediction (ICP)}. 
There is a subtle difference. causally invariant prediction refers to the situation where a model makes predictions based on causal factors and, consequently, its performance remains invariant across domains.   
On the other hand, invariant causal prediction refers to the situation where a model's performance remains invariant across domains and, consequently, its input variables can be considered as causes for the output variable. 
CIP is for domain generalization while ICP is for causal discovery.
In addition, our work involves latent variables ($\Xc$ and $\Xn$) while \cite{peters2016causal} deal with only observed variables.
 
 Our Theorem \ref{theo.ID-optimal} is closely related to Theorem 1 of \citet{mahajan2021domain} and Theorem 3.2 of \citet{arjovsky2020out}.
 However, the causal model used by \citet{mahajan2021domain} has three more latent variables than the one we use.  In fact, our model can be viewed as their model with the additional latent variables ``integrated out''. As such, our theorem targets a more general setting.
 In addition, their theorem focuses exclusively on feature matching and hence cannot be used to motivate logit attribution matching (LAM).
 Arjovsky's theorem also focuses on the feature extractor. It requires examples with the same feature representation to have approximately the same output probability distributions under the generative model. In this sense, it seeks to obtain features with invariant prediction by the {\em generative model}.  In contrast, our theorem requires a {\em prediction model} to be invariant to the non-causal factors. While Arjovsky's theorem is used to motivate a DG algorithm called invariant risk minimization (IRM), our theorem is used to justify consistency regularization.
 

In this paper, we use a causal theory of domain generalization to motivate consistency regularization methods. It should be noted that there are other theories for domain generalization that are based on divergence between domains~\citep{ben2010theory,liu2020towards}. Those theories are used to motivate 
the domain invariant representation approach to domain generalization. However, they cannot be used to justify consistency regularization methods. 


 


 


 
\section{More details of SS pair creation using Targeted DA}
\label{creation}


An SS pair is formed by a training example and an augmented example. The SS pair creation using Targeted DA for each dataset has been introduced in Section \ref{sec:datasets}. We provide more details and examples here.

\subsection{iWildCam and iwildcam-N}

For iWildCam and iWildCam-N, we utilized a Targeted DA technique named Copy-Paste (same-y) from \citet{gao2023out}. This DA method pastes the animal foreground onto a background image sampled from the same habitat where the same animal species has been observed. There is a category of images labeled ``empty'' in the iWildCam dataset. These images do not contain any animals and were used as background images when creating augmented examples. We used the segmentation for the animal foregrounds provided by \citet{beery2021iwildcam} to apply this DA. Augmented examples produced by this DA approach are provided in Figure \ref{fig:iwildcam_pairs}.

% The Copy-Paste (same-y) DA treats the animal foregrounds and the high-level habitat features in the background as shared semantic contents, while introducing randomness to the low-level background features. 


\begin{figure*}[ht]
	\begin{center}
		\includegraphics[width=11cm]{figs1/iwild_example.png}
	\end{center}
	\caption{SS pairs created via Copy-Paste (same-y) DA for iWildCam. This DA method involves pasting the animal onto another image without animals sampled from the location where the same animal species has been observed.}
	\label{fig:iwildcam_pairs}
\end{figure*}


\subsection{ImageNet-9}

In our main experiments, the synthetic images with a black background were used as augmented data for ImageNet-9. Those augmented examples were created based on the GrabCut segmentation. As described in Section \ref{impact_qq}, to assess the performance of LAM under augmented examples in various qualities, we also considered the augmented examples created based on the bounding boxes and semantic segmentation. Specifically, we used the bounding boxes provided by the ImageNet \citep{deng2009imagenet} and semantic segmentation produced via FCN~\citep{long2015fully}, a semantic segmentation method. Augmented examples in various qualities are given in Figure \ref{fig:pair quality}. %Since the creation of augmented examples for NICO is similar to that for the ImageNet-9, we do not elaborate on it further.

\begin{figure*}[ht]
	\begin{center}
		\includegraphics[width=11cm]{figs1/pair_quality_v3.png}
	\end{center}
	\caption{Augmented examples in various qualities created for ImageNet-9.}
	\label{fig:pair quality}
\end{figure*}

\subsection{NICO}
For creating the augmented examples for NICO, we placed the foreground segmentation onto the background of a random image. We used GrabCut \citep{rother2004grabcut} to identify the foreground segmentation for 20 images in each class of NICO, which constituted about 5\% of its training data. On average, the segmentation of an image took us around three seconds.

Since NICO does not have ``empty'' background images like iWildCam, we had to create synthetic background images. To do this, we removed the foreground in the image by coloring the image region corresponding to the foreground segmentation in black. We created the synthetic background images for all images with the foreground segmentation. When creating the augmented example, the foreground segmentation in the training example is pasted onto a randomly selected synthetic background image. See Figure \ref{fig:nico_pairs} for some NICO augmented examples.


%we used the strategy proposed by \citet{xiao2020noise}. Specifically, we first created the ``tiled'' version of the background by finding the largest rectangular strip (horizontal or vertical) outside the foreground bounding box, and tiling the entire image with that strip. An example of the tiled background image is in Figure \ref{bg_create} (b). The final background image (Figure \ref{bg_create} (c)) was created by replacing the foreground bounding box with the same region in the tiled background image. 







\begin{figure*}[ht]
	\begin{center}
		\includegraphics[width=11cm]{figs1/nico_background_examples.png}
	\end{center}
	\caption{{SS pairs created for NICO by placing the foreground segmentation onto a randomly selected synthetic background image.}}
	\label{fig:nico_pairs}
\end{figure*}


\begin{figure*}[ht]
	\begin{center}
		\includegraphics[width=11cm]{figs1/cam_example.png}
	\end{center}
	\caption{SS pairs created by stain color jitter for Camelyon dataset. This DA randomizes the average stain level in the image.}
	\label{fig:cam_pairs}
\end{figure*}

\begin{figure*}[ht]
	\begin{center}
		\includegraphics[width=11cm]{figs1/style_shift_example_v3.png}
	\end{center}
	\caption{SS pairs created via StableDiffusion that generates augmented example from the training examples of the {\em photo} domain in the PACS dataset. The prompt we use is ``a minimalist drawing of a \texttt{class\_name}, outline only, no texture'' where \texttt{class\_name} is the name of the true class label.}
	\label{fig:style-shift pairs}
\end{figure*}


\subsection{Camelyon}

In dealing with the Camelyon dataset, we adopted the strategy outlined in \citet{gao2023out} to use the stain color jitter \citep{tellez2018whole} as the Targeted DA to create the augmented examples. This technique transforms images by jittering their color in the hematoxylin and eosin staining color space. This DA addresses the style shift associated with the stain color resulting from diverse staining techniques used across different hospitals. It randomizes the average stain level in each image while maintaining all other information as predictive features. Some augmented examples are shown in Figure \ref{fig:cam_pairs}.




\subsection{PACS}

To create SS pairs for PACS, we used StableDiffusion~v2~\citep{rombach2022high} to translate images from the {\em photo} domain of PACS into a different style. Given a training example $\x$ of label $y$, we added a mild level of Gaussian noise to the latent representation of $\x$, and then removed the noise under the guidance of a text prompt.
The prompt we used is ``a minimalist drawing of a \texttt{class\_name}, outline only, no texture'' where \texttt{class\_name} is the name of $y$.
We chose this prompt because it produces the best visual quality among what we have explored.
Finally, we decoded the generated noise-free latent representation, producing the corresponding augmented example $\tx$.
See Figure \ref{fig:style-shift pairs} for some examples.





% \subsection{Implementation summary of different datasets}
% \label{sec:data summary}
% In \ref{tab:data summary}, we summarize setting of each dataset in model training, which includes how many training examples used to create augmented pairs and corresponding methods.


% \begin{table*}[ht]
% 	\small
% 	\begin{center}
% 		\caption{Dataset details of shifts, pair quantity and methods to create DA examples. We use ``DA'' as a shorthand for ``augmentation''.}
		
% 		\
		
% 		\label{tab:data summary}
		
% 		\begin{tabular}{ccll}
% 			\toprule
% 			\multirow{2}{*}{\textbf{Dataset}} & \multirow{2}{*}{\textbf{Shift}} & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}\textbf{Pair quantity}\onedot \\ (\% of training examples)\end{tabular}} & \multirow{2}{*}{\textbf{Method to create DA examples}} \\
% 			&  & &   \\ \midrule
% 			\multirow{2}{*}{ImageNet-9} & \multirow{2}{*}{Background} & \multicolumn{1}{c}{\multirow{2}{*}{\begin{tabular}[c]{c} 5\% \end{tabular}}} &  \multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Only preserve foreground objects,\\ remove background as black\end{tabular}}  \\
% 			&  & &   \\ \midrule
% 			\multirow{3}{*}{NICO} & \multirow{3}{*}{Background} & \multicolumn{1}{c}{\multirow{3}{*}{\begin{tabular}[c]{c} 5\% \end{tabular}}} & \multirow{3}{*}{\begin{tabular}[c]{@{}l@{}}Only preserve foreground objects,\\ background replaced with one \\sampled from other images\end{tabular}}\\
% 			&  & &   \\
% 			&  & &   \\ \midrule
% 			\multirow{4}{*}{\begin{tabular}[c]{@{}c@{}}iWildCam\end{tabular}} & \multirow{4}{*}{\begin{tabular}[c]{@{}c@{}}Background\end{tabular}} & \multicolumn{1}{c}{\multirow{4}{*}{\begin{tabular}[c]{@{}c@{}}All images \\with animals\end{tabular}}} &  \multirow{4}{*}{\begin{tabular}[c]{@{}l@{}}The animals are cut-and-paste to \\another image without animals taken \\at a different location where the same \\animals sometimes appear\end{tabular}} \\ 
% 			&  & &   \\ 
% 			&  & &   \\ 
% 			&  & &   \\ \midrule
% 			\multirow{4}{*}{\begin{tabular}[c]{@{}c@{}}iWildCam-N\end{tabular}} & \multirow{4}{*}{\begin{tabular}[c]{@{}c@{}}Background\end{tabular}} & \multicolumn{1}{c}{\multirow{4}{*}{\begin{tabular}[c]{@{}c@{}} All images \\with animals \end{tabular}}} &  \multirow{4}{*}{\begin{tabular}[c]{@{}l@{}}The animals are cut-and-paste to \\another image without animals taken \\at a different location where the same \\animals sometimes appear\end{tabular}} \\ 
% 			&  & &   \\ 
% 			&  & &   \\ 
% 			&  & &   \\ \midrule
% 			\multirow{4}{*}{PACS} & \multirow{4}{*}{\begin{tabular}[c]{@{}l@{}}Style\end{tabular}} & \multirow{4}{*}{\begin{tabular}[c]{@{}l@{}}100 samples for each\\ class in Photo domain (P) \end{tabular}} & \multicolumn{1}{c}{\multirow{4}{*}{\begin{tabular}[c]{@{}l@{}}Employ StableDiffusion to transform the \\image style using the text prompt ``a \\minimalist drawing of a {\tt class\_name}, \\outline only, no texture'' \end{tabular}}} \\
% 			&  & &  \\
% 			&  & &   \\
% 			&  & &   \\ \midrule
% 			Camelyon & Style & \multicolumn{1}{c}{100\%} & Use augmentation of stain color jitter \\
% 			\bottomrule
% 		\end{tabular}
% 	\end{center}
% \end{table*}

\newpage

\section{Details of iWildCam-N dataset}
\label{more_dataset}

\textbf{iWildCam-N} dataset is an altered version of the iWildCam dataset~\citep{beery2020iwildcam, koh2021wilds}, which includes extra background noise in addition to the original background shift in the iWildCam. This additional noise was created by inserting an animal foreground of a different animal species, sampled from a randomly selected image, onto the background of the image. To ensure the main semantic context of the image is not distorted due to the introduced noise, we limited the size of the introduced animal to be smaller than the pre-existing animal foreground and took steps to prevent overlap between the newly incorporated animal and the original animal foreground. We applied this operation on all images in the iWildCam dataset except for the images in the ``empty'' category, which do not contain any animals. The ``empty'' category was also excluded from the iWildCam-N dataset.

In Figure \ref{iwildcam-n}. We provide some examples of the iWildCam-N and their original images in the iWildCam to illustrate the background noise introduced in iWildCam-N.

\begin{figure*}[ht]
	\centering
	\begin{tabular}{cc|cc}
		{{iWildCam}}  & {iWildCam-N} & {{iWildCam}}  & {iWildCam-N}	\\
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/a1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/a2.jpg} &
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/b1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/b2.jpg} \\
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/c1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/c2.jpg} &
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/d1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/d2.jpg} \\
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/e1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/e2.jpg} &
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/f1.jpg} & 
		\includegraphics[height=2.7cm,width=2.7cm]{figs1/in/f2.jpg} \\
		
	\end{tabular}
	\caption{Sample images in iWildCam-N. The background noise is created by adding other small animals to the background of each image.}
	\label{iwildcam-n}
\end{figure*}


\newpage

\section{Additional Implementation Details}
\label{imple}

The use of augmented examples in different methods, including in the ERM+DA, CR-based DG methods, and other multi-source and single-source methods, has been introduced in Section \ref{experiments}.  We provide a summary in Table \ref{tab:data summary}.

\begin{table*}[ht]
        \vspace{2mm}
	\small
	\begin{center}
		\caption{The use of training data in different methods.}
		
		\
		
		\label{tab:data summary}
		
		\begin{tabular}{ccll}
			\toprule
			\textbf{Category} & \textbf{Methods} & \multicolumn{1}{c}{\textbf{Training data}} & \multicolumn{1}{c}{\textbf{Remark}} \\ \midrule
			Baseline & ERM & training examples &  \multicolumn{1}{c}{-}  \\ \midrule
			\multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}ERM+DA \& \\ Single-source\end{tabular}} & \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}ERM+DA\\RSC, SD\end{tabular}} & \multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}training examples\\+ aug. examples\end{tabular}} & \multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}As additional training data, augmented examples are\\ combined with training examples to train the model.\end{tabular}}\\
			&  & &   \\ \midrule
			\multirow{3}{*}{\begin{tabular}[c]{@{}c@{}}CR-based\end{tabular}} & \multirow{3}{*}{\begin{tabular}[c]{@{}c@{}}LAM, KL, JS,\\LM, FM\\TLM, TPM\end{tabular}} & \multirow{3}{*}{\begin{tabular}[c]{@{}l@{}}training examples\\+ aug. examples\end{tabular}} &  \multirow{3}{*}{\begin{tabular}[c]{@{}l@{}}The training examples are paired with\\ augmented examples to train the model.\end{tabular}} \\ 
			&  & &   \\
			&  & &   \\ \midrule
			\multirow{2}{*}{Multi-source} & DANN, GDRO & \multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}$d_1$: training examples\\ $d_2$: aug. examples\end{tabular}} &\multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Training examples are regarded as one domain; \\augmented examples form another domain. \end{tabular}} \\
			& IRM, VREx & &   \\
			\bottomrule
		\end{tabular}
	\end{center}
\end{table*}



%

All experiments were conducted on a single NVIDIA V100 GPU. For ImageNet-9, NICO, and PACS, we used the two-step training strategy of linear probing and then full finetuning (LP-FT) \citep{kumar2022fine}, while for other datasets we did normal finetuning.
% During LP, we used a learning rate of 0.003. During FT, we used a learning rate of 3e-5 for ImageNet-9, NICO, PACS, and 3.49e-5 for iWildCam, 3.07e-3 for Camelyon. During LP,  models were trained for 10 epochs. For FT, models were trained 20 epochs for ImageNet-9, NICO, iWildCam, 40 epochs for PACS, and 10 epochs for Camelyon.
The summary of the hyperparameter setting is shown in Table \ref{tab:hyperparameter setting}.


\begin{table*}[ht]
    \vspace{2mm}
    \centering
    \caption{Hyperparameter setting for all the main experiments. SS pair transformation refers to the transformation applied to training examples and corresponding augmented examples while training. For other DG methods, we use the default hyperparameters provided by DomainBed~\citep{gulrajani2021in} as the initial values, followed by a hyperparameter tuning process. ``bs'' stands for batch size. }
    \label{tab:hyperparameter setting}
		\vspace{2mm}
		
		
		\small
		\begin{tabular}{|c|ccccc|}
			\hline
			Dataset & \multicolumn{2}{c|}{ImageNet-9 \& NICO} & \multicolumn{1}{c|}{PACS} & \multicolumn{1}{c|}{iWildCam} & Camelyon \\ \hline
			Model & \multicolumn{2}{c|}{CLIP ViT-B/16} & \multicolumn{1}{c|}{CLIP ResNet-50} & \multicolumn{1}{c|}{ResNet-50} & DenseNet-121 \\ \hline
			Pretrained & \multicolumn{4}{c|}{ImageNet pretrained} & False \\ \hline
			Image Size & \multicolumn{3}{c|}{[224, 224]} & \multicolumn{1}{c|}{[448, 448]} & \multicolumn{1}{c|}{[96, 96]}  \\ \hline
			\multirow{12}{*}{\begin{tabular}[c]{@{}c@{}}LAM/\\ Logit Match (LM)/\\ Prob. Match (KL)\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}LP/FT epochs: 10/20\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}LP/FT epochs: 10/40\end{tabular}} & \multicolumn{1}{c|}{epochs: 20} & epochs: 10\\ \cline{2-6} 
			& \multicolumn{3}{c|}{\begin{tabular}[c]{@{}c@{}}LP/FT learning rate: 0.003/3e-5\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}learning rate: 3.49e-5\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}learning rate: 3.07e-3\end{tabular}}  \\ \cline{2-6} 
			& \multicolumn{2}{l|}{\begin{tabular}[c]{@{}l@{}}LP/FT training bs: 128/64\\ LP/FT SS pair bs: 256/64\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}LP/FT training bs: 48/48\\ LP/FT SS pair bs: 32/32\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}training bs: 10 \\ SS pair bs: 10\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}training bs: 128\\ SS pair bs: 128\end{tabular}} \\ \cline{2-6} 
			& \multicolumn{1}{c|}{$\lambda=10$} & \multicolumn{1}{c|}{$\lambda=0.5$} & \multicolumn{1}{c|}{$\lambda=0.2$} &  \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}$\lambda=5$ (LAM, KL)\\ $\lambda=0.05$ (LM) \end{tabular}} &  \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}$\lambda=10$ (LAM)\\ $\lambda=1$ (LM, KL) \end{tabular}} \\ \cline{2-6} 
			& \multicolumn{2}{l|}{\begin{tabular}[c]{@{}l@{}}SS pair transform:\\ RandCrop\\ RandHorizontalFlip\\ Normalize\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}SS pair transform:\\ RandCrop\\ RandHorizontalFlip\\ ColorJitter\\ RandGrayscale\\ Normalize\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}SS pair transform:\\ Normalize\end{tabular}}  & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}SS pair transform:\\ Normalize\end{tabular}}  \\ \cline{2-6} 
			& \multicolumn{2}{c|}{N/A}&\multicolumn{1}{c|}{$p=0.9$} & \multicolumn{2}{c|}{N/A} \\ \hline
			\begin{tabular}[c]{@{}c@{}}Feature\\Matching (FM) \end{tabular} & \multicolumn{3}{c|}{$\lambda=0.01$} & \multicolumn{1}{c|}{$\lambda=0.05$} & $\lambda=0.1$ \\ \hline
			Prob. Match (JS) & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}FT training bs: 32\\ FT SS pair bs: 48\end{tabular}} &\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}FT training bs: 48\\ FT SS pair bs: 48\end{tabular}} &\multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}FT training bs:10 \\ FT SS pair bs: 20 \end{tabular}} & \begin{tabular}[c]{@{}c@{}}FT training bs: 128 \\ FT SS pair bs: 128 \end{tabular}\\ \hline
			\begin{tabular}[c]{@{}c@{}}Other Methods\end{tabular} & \multicolumn{2}{l|}{\begin{tabular}[c]{@{}l@{}}LP/FT training bs: 128/64\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}LP/FT training bs: 48/48\end{tabular}} & \multicolumn{1}{l|}{training bs: 24} & \multicolumn{1}{l|}{training bs: 128} \\ \hline
		\end{tabular}
\end{table*}

%\textcolor{blue}{\section{Study on the effect of $\lambda$}} 



 \newpage

\section{Visualizations about the Effects of LAM}
\label{vis}



In Section
\ref{sec:CR-labeled}, we have argued that LAM  exerts two complementary regularization forces, one on the feature extractor and another on the classification head. In combination, they encourage a model to focus on the causal factors when making predictions.

To provide some empirical evidence for the claim, 
 we show in
Figure \ref{fig:weight distribution}  the weight distributions of the classification heads of three models trained on the ImageNet-9 dataset.  We see that the LAM model has significantly fewer high weights than those of the other two models.  
This indicates that the LAM is indeed more ``focused" than the other models.

\begin{figure*}[h!]
    \vspace{2mm}
    \centering
    \begin{tabular}{ccc}
        \includegraphics[width=5.2cm]{figs1/ERM+DA.png}
        & 
        \includegraphics[width=5.2cm]{figs1/JS.png}
        & 
        \includegraphics[width=5.2cm]{figs1/LAM.png}
        \\
         (a)  ERM+DA &  (b) Prob. Match (JS) &  (c)  LAM \\
    \end{tabular}
    \caption{Distributions of the weights of the classification heads of the models learned using ERM+DA, Probability Matching (JS), and LAM  on ImageNet-9 dataset. }
    \label{fig:weight distribution}
\end{figure*}

What does the LAM model focus on?  
Visual examples in Figure~\ref{fig:feature_map} indicate that it focuses on the foreground objects. This claim is also supported by the additional examples in Figure~\ref{fig:more_saliency_map}.



\begin{figure*}[h!]
    \vspace{2mm}
    \centering
    \includegraphics[width=10cm]{figs1/gradcam_new_uai_more_compressed.png}
    \caption{GradCAM saliency maps for the top predicted class by models trained on ImageNet-9 using various methods. The model learned using LAM focuses on the foreground objects better.}
    \label{fig:more_saliency_map}
\end{figure*}



