\vspace{-0.5em}
\section{Moment Alignment: A Unifying Framework}\label{sec:unif}
\vspace{-0.5em}
While various approaches to DG exist, they appear largely disconnected, and, to the best of our knowledge, no prior work has explicitly drawn connections between them. In this section, we unify IRM, gradient matching, and Hessian matching under the CMA framework. We further establish a duality between feature learning space and classifier fitting.
\vspace{-0.25em}
\subsection{IRM as Moment Alignment}\label{thm:unif_irm}
\vspace{-0.25em}
When the features are fixed and satisfy the IRM assumption, minimizing the IRMv1 objective~\citep{arjovsky_invariant_2020}
\begin{equation}\label{eq:irmv1_loss}
    \begin{aligned}
        \mathcal{L}_{\text{IRM}} := \mathcal{L}_{\text{ERM}} +  \lambda \frac{1}{K}\sum_{i = 1}^K \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}\left(\boldsymbol{\theta}\right) \|_2^2,
    \end{aligned}
    \tag{IRMv1}
\end{equation}
 recovers such invariant optimal predictor, and~\Cref{thm:moment_alignment_irm} provides an upper bound on the target error. On the other hand, when the fixed features do not satisfy the IRM assumption,
 The IRMv1 penalty seeks a parameter $\boldsymbol{\theta}$ whose average gradient norm is small, thereby minimizing $g$ in the upper bound in \Cref{thm:moment_alignment}.

 \vspace{-0.5em}
\subsection{Gradient and Hessian Matching as Moment Alignment}
\vspace{-0.5em}
Their general gradient and Hessian matching objectives are either the following or their variants:
\begin{equation}\label{eq:gm_loss}
\small
    \begin{aligned}
        \mathcal{L}_{\text{GM}} := \mathcal{L}_{\text{ERM}} +\lambda \frac{1}{K} \sum_{i = 1}^K \left\|\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i} \left(\boldsymbol{\theta}\right)- \overline{\nabla_{\boldsymbol{\theta}} \mathcal{L}\left(\boldsymbol{\theta}\right)}\right\|^2_2
    \end{aligned}
    \tag{GM}
\end{equation}
\begin{equation}\label{eq:hm_loss} \tag{HM}
\small
    \mathcal{L}_{\text{HM}} := \mathcal{L}_{\text{ERM}} +  \lambda \frac{1}{K}\sum_{i = 1}^K \| \mathbf{H}_{\mu_i}\left(\boldsymbol{\theta}\right) - \overline{\mathbf{H}\left(\boldsymbol{\theta}\right)}\|_F^2
\end{equation}
By their definitions, gradient matching and Hessian matching are special cases of moment alignment, reducing the first-order and second-order terms, respectively, in the upper bound of the transfer measure. Notably, when the IRM assumption holds, the penalty in \Cref{eq:gm_loss} will favor an invariant optimal predictor.

%
From the results in \Cref{sec:theory}, aligning both gradients and Hessians improves DG over aligning only one of them. This explains the success of HGP and Hutchinson~\citep{hemati_understanding_2023} over methods that focus on gradient matching~\citep{shi_gradient_2021, parascandolo_learning_2020, koyama_out--distribution_2020} or Hessian matching~\citep{rame_fishr_2022, sun_deep_2016}.
\vspace{-0.5em}
\subsection{Feature Matching as Moment Alignment}\label{sec:unif_feature}
\vspace{-0.5em}
So far, we have discussed moment alignment under fixed features. Next, we establish a connection between the derivatives of the classifier and moments of features, where the classifier is assumed to be the last layer of an NN, i.e., linear predictor over the learned features.

For a softmax classifier, the prediction is a function of $\mathbf{x}^{\top} \boldsymbol{\theta} $, where $\mathbf{x}$ is a feature vector and $\boldsymbol{\theta}$ is the classifier. Therefore, $\mathtt{\nabla^n_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})} $ involves the $n^{\text{th}}$ moment of $\mathbf{x}$, and by matching the $n^{\text{th}}$ order derivatives w.r.t. the classifier head, we are matching the $n^{\text{th}}$ moment of $\mathbf{x}$ across domains.  Another view of this duality is that by the symmetry between $\mathbf{x}$ and  $\boldsymbol{\theta}$, we can derive analogously results in \Cref{sec:theory} with optimization target $\mathbf{x}$.

%

%
%
%
%
%
%
%
%
%
%
%
%
%
%

IRM~\citep{ahuja_invariant_2020} and CORAL~\citep{sun_deep_2016} are two concrete examples of this feature-parameter duality. Going from the feature space to the parameter space, CORAL \citep{sun_deep_2016} matches the feature covariance, namely the second moment of $\mathbf{x}$. Thus, CORAL is approximately Hessian matching in the parameter space. We refer interested readers to Proposition 4 in \citet{hemati_understanding_2023} for discussion on the attributes aligned by CORAL. Conversely, starting from the parameter space and moving to the feature space, the penalty term in \Cref{eq:irmv1_loss} regularizes the gradient w.r.t. the classifier, corresponding to the first-moment alignment in the feature space, i.e., aligning the features themselves.


%
%


