By alignment of both gradient and Hessians in closed form, CMA implicitly integrates multiple existing algorithms. Below we build such connections.
\subsection{CMA as Invariant Risk Minimization}
We draw connections between IRM and CMA objectives. Fixing a feature extractor and letting the classifier head be parameterized by $\theta$, the IRMv1 objective in \citet{arjovsky_invariant_2020} is:
\begin{equation}
    \begin{aligned}
        \mathcal{L}_{\text{IRM}} := \mathcal{L}_{\text{ERM}} +  \lambda \frac{1}{K}\sum_{i = 1}^K \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}\left(\boldsymbol{\theta}\right) \|_2^2
    \end{aligned}
    \tag{IRMv1}
\end{equation}
On the other hand, we can rewrite the gradient variance regularization in \Cref{eq:cma_loss} as
\begin{equation}\label{eq:gv_reg}
    \begin{aligned}
        \frac{1}{K}\sum_{i=1}^K  \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}\left(\boldsymbol{\theta}\right) - \overline{\nabla_{\boldsymbol{\theta}} \mathcal{L}\left(\boldsymbol{\theta}\right)}\|_2^2 & =  \frac{1}{K}\sum_{i=1}^K  \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}\left(\boldsymbol{\theta}\right)\|^2_2 - \|\frac{1}{K}\sum_{j=1}^K \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_j}\left(\boldsymbol{\theta}\right)\|_2^2 \\
    \end{aligned}
\end{equation}
The second term on the right-hand side, the norm of the average gradients, is small for a classifier $\boldsymbol{\theta^*}$ well-trained on $\mathcal{L}_{\text{ERM}}$, and the first term resembles the regularization in \Cref{eq:irmv1_loss}. Therefore, penalizing large gradient variance can be seen as enforcing the learned classier $\boldsymbol{\theta}$ to be invariant across domains. Under the same assumptions as in \Cref{thm:moment_alignment_irm}, at the optimal invariant predictor $\boldsymbol{\theta^*}$, the norm of the average of gradients is zero, making the gradient variance term in \Cref{eq:cma_loss} exactly the gradient penalty in \Cref{eq:irmv1_loss}. By setting $\beta = 0$ in \Cref{eq:cma_loss}, we recover \Cref{eq:irmv1_loss}.

\subsection{CMA as Gradient Matching}
While multiple version of gradient matching losses have been proposed~\citep{shi_gradient_2021, koyama_out--distribution_2020, parascandolo_learning_2020}, we focus on the most recent one proposed by \citet{shi_gradient_2021}, defined as:
%
%
%
%
%
\begin{equation}\label{eq:gm_loss_2}
    \begin{aligned}
        \mathcal{L}_{\text{GM}} := \mathcal{L}_{\text{ERM}} +\lambda \frac{1}{K} \left( \sum_{i = 1}^K \left\|\nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}\left(\boldsymbol{\theta}\right)\right\|^2_2 - \left\|\sum_{j = 1}^K \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_{j}}\left(\boldsymbol{\theta}\right)\right\|^2_2 \right)
    \end{aligned}
    \tag{GM}
\end{equation}
Comparing the second term with \Cref{eq:gv_reg}, and ignoring the constant factor $\lambda$, the difference is $\frac{K-1}{K^2} \|\sum_{j = 1}^K \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_{j}}\left(\boldsymbol{\theta}\right)\|^2_2$. When an invariant optimal predictor $\boldsymbol{\theta^*}$ exists, this difference vanishes, and setting $\beta = 0$ in \Cref{eq:cma_loss} recovers \Cref{eq:gm_loss_2}.

\subsection{CMA as Hessian Matching}\label{sec:unif_hm}
We first compare CMA with Fishr~\citep{rame_fishr_2022}, a state-of-the-art DG algorithm based on Hessian matching.
The principle behind Hessian matching is to match the domain-level Hessian matrices by minimizing the objective:
\begin{equation}\label{eq:hm_loss_2} \tag{HM}
    \mathcal{L}_{\text{HM}} := \mathcal{L}_{\text{ERM}} +  \lambda \frac{1}{K}\sum_{i = 1}^K \| \mathbf{H}_{\mu_i} - \overline{\mathbf{H}}\|_F^2
\end{equation}

\citet {rame_fishr_2022} achieves this by approximating the Hessian matrices with their diagonals. In contrast, we proposed to compute the Hessian matrices analytically.
Thus, by setting $\alpha = 0$, \Cref{eq:cma_loss} is the closed-form of the Fishr objective.

Next, we compare CMA with the two objectives proposed in \citet{hemati_understanding_2023}, namely HGP and Hutchinson's method (eq. (18) and eq. (23) in \citet{hemati_understanding_2023}): 
\begin{equation}\label{eq:hgp_loss} \tag{HGP}
    \mathcal{L}_{\text{HGP}} = \mathcal{L}_{\text{ERM}} + \frac{1}{K}\sum_{i=1}^K \alpha \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i} - \overline{\nabla_{\boldsymbol{\theta}} \mathcal{L}}\|_2^2 + \beta \| \mathbf{H}_{\mu_i} \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i} - \overline{\mathbf{H}\nabla_{\boldsymbol{\theta}} \mathcal{L}}\|_2^2
\end{equation}
where $ \overline{\mathbf{H}\nabla_{\boldsymbol{\theta}} \mathcal{L}} = \frac{1}{K} \sum_{i=1}^K \mathbf{H}_{\mu_i} \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i}$ is the average Hessian-gradient product.
\begin{equation}\label{eq:hutchinson_loss} \tag{Hutchinson}
    \mathcal{L}_{\text{Hutchinson}} = \mathcal{L}_{\text{ERM}} + \frac{1}{K}\sum_{i=1}^K \alpha \| \nabla_{\boldsymbol{\theta}} \mathcal{L}_{\mu_i} - \overline{\nabla_{\boldsymbol{\theta}} \mathcal{L}}\|_2^2 + \beta \| \mathbf{D}_{\mu_i} - \overline{\mathbf{D}}\|_2^2
\end{equation}
where $\mathbf{D}_{\mu_i}$ is the Hessian diagonal estimated by Hutchinson's method~\citep{bekas_estimator_2007}. Like CMA, HGP, and Hutchinson match the first and second moment across domains. Unlike CMA, HGP approximates the second-order penalties with Hessian-gradient products, while Hutchinson's method estimates them with Hessian diagonals which themselves are estimated by sampling. In other words, \Cref{eq:cma_loss} is the closed form of \Cref{eq:hgp_loss} and \Cref{eq:hutchinson_loss}.
