\begin{table*}[h]
\vspace{-1em}
    \centering 
    %
    \setlength\tabcolsep{3pt}
    \caption{Comparison of our method and prior algorithms.}
    \vspace{-0.5em}
    \resizebox{0.75\textwidth}{!}{
    \begin{tabular}{lcccccc}
        \toprule
                            & ERM & IRM & Fish/IGA/AND-Mask & Fishr/CORAL & HGP/Hutchinson & \textbf{CMA} \\
        \midrule
        Gradient Matching   & No  & Yes & Yes               & No    & Yes            & \textbf{Yes}  \\
        Hessian Matching    & No  & No  & No                & Yes   & Yes            & \textbf{Yes}  \\
        Closed-Form Hessian & --  & --  & --                & No    & No             & \textbf{Yes}  \\
        \bottomrule
    \end{tabular}}
    \label{tab:methods_comparison}
    \vspace{-1em}
\end{table*}
\vspace{-0.5em}
\section{Introduction}\label{sec:introduction}
\vspace{-0.5em}
Classic machine learning methods rely on the assumption that training and test data are drawn from the same distribution, typically described as being independent and identically distributed (\textit{i.i.d.}). However, the \textit{i.i.d.} assumption is often violated in real-world scenarios due to variations in sampling populations~\citep{santurkar_breeds_2020}, temporal changes~\citep{shankar_image_2019}, and geographic differences~\citep{hansen_high-resolution_2013, christie_functional_2018}.\ Performance degradation due to distribution shifts is particularly critical in high-stake applications. For instance, an autonomous driving system~\citep{dai_dark_2018, hu_causal-based_2021} trained on data collected in the United States may encounter different traffic conditions when deployed in other regions.\ Similarly, in medical imaging~\citep{wachinger_detect_2021,albadawy_deep_2018,tellez_quantifying_2019}, models trained on data from one demographic group may face challenges when applied to a different demographic.

Domain generalization (DG) aims to tackle this issue by leveraging data from multiple source domains to learn a model that performs well on unseen but related target domains. Although various approaches have been studied to address the DG problem, including Invariant Risk Minimization (IRM)~\cite{arjovsky_invariant_2020}, gradient matching~\citep{shi_gradient_2021,koyama_when_2021,parascandolo_learning_2020}, Hessian matching~\citep{rame_fishr_2022,hemati_understanding_2023}, and domain-invariant feature representation learning~\citep{ben-david_theory_2010, li_domain_2018,tzeng_adversarial_2017, hoffman_cycada_2017, muandet_domain_2013, long_learning_2015, zhao_learning_2019}, these methods often appear disconnected and are based on different underlying principles. We discuss these related research in~\Cref{app:related_works}.\looseness=-1

We unify these seemingly disparate methods through the theory of moment alignment. Our theory builds upon \textit{transfer measure}, a principled DG framework proposed by~\citet{zhang_quantifying_2021}. We first extend the definition of transfer measure to multi-source DG, inducing a target error bound. We then prove that aligning the derivatives improves transfer measure under different assumptions: when there exists a classifier that is simultaneously optimal across all domains (referred to as the \textit{IRM assumption}), and when there is not. We show that IRM, gradient matching, and Hessian matching approaches are special cases of moment alignment. Our theory explains the success of state-of-the-art methods like HGP and Hutchinson's algorithm~\citep{hemati_understanding_2023}, which perform both gradient and Hessian matching. This combined approach provides an advantage over methods that only match gradients or Hessians. Furthermore, we establish the duality between feature moments and the derivatives of the classifiers, thereby unifying these approaches.

Drawing from the theoretical results, we proposed \textbf{C}losed-Form \textbf{M}oment \textbf{A}lignment (CMA), a novel algorithm to DG that aligns the first- and second-order derivatives across domains. The loss objective in CMA is similar to those of HGP and Hutchinson's, but CMA enjoys computational efficiency by analytically computing gradients and Hessians. Our method bypasses the computational limitations of existing gradient and Hessian matching techniques that rely on repeated backpropagation or sampling-based estimation. Additionally, we provide two Hessian computation methods—direct Frobenius norm computation for faster performance at higher memory cost, and a memory-efficient method that reduces memory requirement at the expense of increased computation time. This flexibility allows users to balance memory usage and computational time.

The empirical evaluation consists of two settings designed to validate our theoretical framework and proposed algorithm. First, we conduct linear probing experiments on Waterbirds, CelebA, and MultiNLI datasets, where the IRM assumption holds. Second, we perform full fine-tuning experiments on selected datasets from the DomainBed benchmark~\citep{gulrajani_search_2020}, where the IRM assumption may not be satisfied. In the DomainBed experiment, where the IRM assumption is not guaranteed. We compare CMA with ERM, CORAL~\citep{sun_deep_2016}, and Fishr~\citep{rame_fishr_2022}. CMA's performance aligns with our theory and matches state-of-the-art performance.

%
Below we summarize our main contributions:
\begin{itemize}[noitemsep,topsep=0pt]
    \item \textit{Unified Theory of Moment Alignment:} We develop a theory of moment alignment that unifies IRM, gradient matching, and Hessian matching.\ This unified framework enhances our understanding of the interplay between these methods and their combined effect on improving generalization across domains.\ We further establish the duality between feature moments and the classifier derivatives.
    
    %
    
    %
    
    \item \textit{New Algorithm:}\ We propose \textbf{C}losed-Form \textbf{M}oment \textbf{A}lignment (CMA), a novel DG algorithm that performs both gradient and Hessian matching. CMA enjoys computational efficiency by analytically computing gradients and Hessians, avoiding the need for repeated backpropagation or sampling-based estimation.\ We offer two Hessian computation methods to optimize memory usage and computational speed.
    \item \textit{Empirical Validation:} We validate CMA through both quantitative and qualitative analyses. CMA matches state-of-the-art performance while achieving superior worst-group accuracy and feature moment alignment, reducing first- and second-moment discrepancies more effectively than Fishr and ERM.

\end{itemize}
%

Our work offers a unified perspective that enhances theoretical understanding and practical performance in addressing distribution shifts. As summarized in \Cref{tab:methods_comparison}, our method is, to the best of our knowledge, the first to achieve exact gradient and Hessian matching.

