\section{Introduction}

\section{Introduction}

Most problems in learning theory assume identically distributed data.
In contrast, in domain generalization~\cite{blanchard2011generalizing,muandet2013domain,blanchard2021domain}, the learner observes $N$ data samples $(x_i^{(1)},y_i^{(1)})_{i=1}^{n},\ldots,(x_i^{(N)},y_i^{(N)})_{i=1}^{n}\in (\mathcal{X}\times \mathcal{Y})^{n\times N}$
which follow different distributions $P^{(1)},\ldots,P^{(N)}$.
These \textit{source} distributions model different real-world \textit{domains}, e.g., different medical patients.
The goal of domain generalization is to find a model $g:(x_i^{T})_{i=1}^{n}\mapsto (\widehat{f}:\mathcal{X}\to\mathcal{Y})$ that is able to derive a predictor $\widehat{f}:\mathcal{X}\to\mathcal{Y}$ from only \textit{unlabeled} data $(x_i^{T})_{i=1}^{n}\in\mathcal{X}^n$ following a new \textit{target} distribution $P^T$.
The performance of $g$ is quantified in expectation over a random draw of the target distribution $P^T$, i.e., by $\mathcal{E}^\infty(g):=\int_{\mathcal{M}_1^+} \int_\mathcal{X} (\widehat{f}(x)-y)^2\diff P^T(x,y)\diff E(P^T)$ for some meta-distribution (or \textit{environment}~\cite{baxter1998theoretical}) $E$ from which $P^T$ is drawn.

This work is concerned with the worst-case sample complexity of domain generalization.
More precisely, we study the question of \textit{whether and how the model $g$ above can be computed from the $N$ given samples, such that it satisfies finite sample bounds on $\mathcal{E}^\infty(g)-\inf_{h}\mathcal{E}^\infty(h)$ under some assumptions on the data generating process?}

Although many finite sample results are available in the similar settings of meta-learning (cf.~\cite{maurer2005algorithmic}) and multi-task learning (cf.~\cite{evgeniou2005learning}), relatively less is known for domain generalization.
Indeed, the inaccessibility of target labels requires new techniques for domain generalization.
For example, to learn from unlabeled data of a target distribution $P^T$, it is required to relate the marginal distribution $P_\mathcal{X}^T(x)$ (of inputs $x\in\mathcal{X}$) to the conditional distribution $P^T(y|x)$ (of outputs $y\in\mathcal{Y}$ w.r.t.~inputs), where $P^T(x,y)=P_\mathcal{X}^T(x) P^T(y|x)$.
This is implicitly done in the seminal \textit{marginal transfer} approach of~\cite{blanchard2011generalizing,blanchard2021domain}, where a \textit{universal consistent} algorithm is proposed, i.e., an algorithm computing $g$ such that $\mathcal{E}^\infty(g)\to \inf_{h}\mathcal{E}^\infty(h)$ almost surely for $n,N\to\infty$.
However, to the best of our knowledge, specifying the rate of this convergence is still an open research problem.

The main conceptual contribution of this work is to recast domain generalization as a problem of functional regression, which allows for analytical results from that field.
More precisely, we propose a new algorithm for \textit{explicitly} learning (from the input samples) an operator, which maps (kernel mean embeddings of) input marginal distributions $P_\mathcal{X}(x)$ to approximations of the regression functions (Bayes predictors) $f_{P}:=\argmin_{f:\mathcal{X}\to\mathcal{Y}}\int_{\mathcal{X}\times\mathcal{Y}} (f(x)-y)^2\diff P(x,y)$ of the corresponding conditional distributions $P(y|x)$.
We, here, in a first step, focus on learning a linear operator with slope functions residing in an RKHS, which allows us to apply analytical arguments from~\cite{mollenhauer2022learning,jin2022minimax,tong2022non} resulting in explicit finite sample bounds.
However, our new concept opens new directions for domain generalization by linear and non-linear operator learning.

Another, particularly practical, advantage of our method is the possibility to choose different reproducing kernel Hilbert spaces (RKHSs) for regression on different source distributions $(P^{(i)})_{i=1}^N$.
This allows expert-choices and automation, e.g., by choosing between well-known kernels by cross-validation.
We provide a numerical example which illustrates this advantage and gives a simple implementation of the proposed algorithm.
Our contributions can be summarized as follows:\begin{itemize}
\itemsep0em 
    \item We provide a new concept for approaching domain generalization by functional regression.
    \item We propose a new algorithm which allows a domain-specific data-based construction of predictors, e.g., different learned RKHSs for different domain. As one consequence, new target predictors are not needed to be well approximable by pre-defined (e.g.~Gaussian) RKHSs.
    \item We provide (to the best of our knowledge first) finite sample bounds for $\mathcal{E}^\infty(g)-\inf_{h}\mathcal{E}^\infty(h)$ in the domain generalization setting of~\cite{blanchard2011generalizing}.
    \item We provide a numerical implementation showing the advantage of our algorithm.
   
\end{itemize}

\section{Background on Domain Generalization}







\subsection{Domain Generalization}

Let $\mathcal{X}\subset\mathbb{R}^{d}$ be a compact \textit{input} space (with Lebesgue measure one for simplicity) and $\mathcal{Y}\subset\mathbb{R}$ be a compact \textit{output} space.
The problem of domain generalization~\cite{blanchard2011generalizing,muandet2013domain,blanchard2021domain} extends the problem of supervised learning by relaxing the assumption of one unique underlying data distribution.
In particular, in domain generalization, we have given a vector
\begin{align}
    \label{eq:source_samples}
    ({\bf z}^{(i)})_{i=1}^N:=
    ({\bf x}^{(i)},{\bf y}^{(i)})_{i=1}^N:=\left((x_j^{(i)},y_j^{(i)})_{j=1}^{n_i}\right)_{i=1}^N\in \left(\bigcup_{n=1}^\infty\left(\mathcal{X}\times\mathcal{Y}\right)^n\right)^N
\end{align}
of \textit{source} samples, drawn independently at random according to $N\in\mathbb{N}$ respective probability measures $P^{(1)},\ldots,P^{(N)}$ from the set $\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ of probability measures on $\mathcal{X}\times\mathcal{Y}$.
For convenience, we represent a sample ${\bf z}^{(i)}$ by its associated empirical probability measure $\widehat{P}^{(i)}:=\frac{1}{n_i}\sum_{j=1}^{n_i} \delta_{(x_j^{(i)},y_j^{(i)})}\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$, where $\delta_z$ is the Dirac delta function on $z\in\mathcal{X}\times\mathcal{Y}$.
The goal in domain generalization is to construct an algorithm
\begin{align}
\label{eq:domain_generalization_algorithm}
    A:\left(\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})\right)^N &\to
    \left\{ g: \mathcal{M}_1^+(\mathcal{X}) \to \{f:\mathcal{X}\to\mathcal{Y}\}\right\}
   
   
\end{align}
which maps the $N$ source samples $\widehat{P}^{(1)},\ldots,\widehat{P}^{(N)}$ to a function
\begin{align}
\label{eq:domain_generalization_function_g}
g:\mathcal{M}_1^+(\mathcal{X}) \to \{f:\mathcal{X}\to\mathcal{Y}\}
\end{align}
that needs only an \textit{unlabeled} target sample ${\bf x}^T=(x_j^T)_{j=1}^{n_T}$, represented by $\widehat{P}_\mathcal{X}^{T}\in\mathcal{M}_1^+(\mathcal{X})$ and
drawn independently at random according to some (marginal) probability measure $P_\mathcal{X}^{T}\in\mathcal{M}_1^+(\mathcal{X})$, to infer a predictor $g(\widehat{P}_\mathcal{X}^T):=f:\mathcal{X}\to\mathcal{Y}$ that performs well on new data  $(x,y)$ drawn (independently from ${\bf x}^T$) according to $P^T$~\citep{blanchard2011generalizing,blanchard2021domain}.

In the \textit{two-stage generative model of domain generalization}~\citep[Assumption~2]{blanchard2021domain}, the probability measures $P^{(1)},\ldots,P^{(N)},P^T$ are drawn independently at random according to a \textit{meta} probability measure $E$ on $\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$
\footnote{If we equip $\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ with $\tau_w(\mathcal{X}\times\mathcal{Y})$, the weakest topology on $\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ such that the mapping $L_h:(\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y}),\tau_w(\mathcal{X}\times\mathcal{Y}))\to\mathbb{R}$ with $L_h(P)=\int_{\mathcal{X}\times\mathcal{Y}} h(x,y)\diff P(x,y)$ is continuous for all bounded and continuous functions $h:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ and denote by $\mathcal{B}(\tau_w(\mathcal{X}\times\mathcal{Y}))$ the associated Borel $\sigma$- algebra, then $(\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y}),\mathcal{B}(\tau_w(\mathcal{X}\times\mathcal{Y})))$ becomes a itself measurable space, cf.~\cite{maurer2005algorithmic,szabo2016learning}.}
, and, the quality of the prediction of the model $g:=A(\widehat{P}^{(1)},\ldots,\widehat{P}^{(N)})$ is measured by the \textit{idealized risk}
\begin{align}
    \label{eq:idealized_risk}
    \mathcal{E}^{\infty}(g)=\int_{\mathcal{M}_1^+(\mathcal{X})} \int_{\mathcal{X}\times\mathcal{Y}} \left(g(P_\mathcal{X})(x)-y\right)^2\diff P(x,y) \diff E(P).
\end{align}
The choice of the error $\mathcal{E}^{\infty}(g)$ models the goal of domain generalization to find (in expectation over the choice of $P^T$) a model $f=g(P_\mathcal{X}^T)$ with a low expected target risk $\int_{\mathcal{X}\times\mathcal{Y}} (f(x)-y)^2\diff P^T$.

\subsection{Marginal Transfer Learning}
\label{subsec:marginal_transfer_learning}

In the seminal works~\cite{blanchard2011generalizing,muandet2013domain,blanchard2021domain}, the predictor $g(\widehat{P}_\mathcal{X}^T):\mathcal{X}\to\mathcal{Y}$
is defined by $g(\widehat{P}_\mathcal{X}^T)(x):=f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}(\widehat{P}_\mathcal{X}^T,x)$ for a
\begin{align}
    \label{eq:domain_generalization_function_g_augmented}
    f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}:\mathcal{M}_1^+(\mathcal{X})\times \mathcal{X}\to\mathcal{Y}
\end{align}
\sloppy that is computed from the (by the input marginals $\widehat{P}_\mathcal{X}^{(i)}$ "augmented") data samples $\left((\widehat{P}_\mathcal{X}^{(1)}, x_j^{(1)}),y_j^{(1)}\right)_{j=1}^{n_1},\ldots, \left((\widehat{P}_\mathcal{X}^{(N)}, x_j^{(N)}),y_j^{(N)}\right)_{j=1}^{n_N}$.
This approach is referred to as \textit{marginal transfer learning}.
More precisely,~\cite{blanchard2021domain} follow~\cite{evgeniou2005learning} and use an RKHS $\mathcal{H}_{\overline{k}}$ generated by a kernel $\overline{k}$ on $\mathcal{M}_1^+(\mathcal{X})\times\mathcal{X}$ defined by
$\overline{k}((P^{(1)},x_1), (P^{(2)},x_2)):=k_{\mathcal{M}_1^+(\mathcal{X})}(P^{(1)},P^{(2)})\cdot k_\mathcal{X}(x_1,x_2)$,
where $k_{\mathcal{M}_1^+(\mathcal{X})}$ is a kernel on $\mathcal{M}_1^+(\mathcal{X})$ and $k_\mathcal{X}$ is a kernel on $\mathcal{X}$.
The model $f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}=f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^\lambda$ in Eq.~\eqref{eq:domain_generalization_function_g_augmented} is computed by penalized risk estimation
\begin{align}
    \label{eq:dom_gen_kernel_least_squares}
    f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^\lambda:= \argmin_{f\in\mathcal{H}_{\overline{k}}}
    \frac{1}{N}\sum_{i=1}^N \frac{1}{n_i}\sum_{j=1}^{n_i}
    \left(f(\widehat{P}_\mathcal{X}^{(i)},x_j^{(i)})-y_j^{(i)}\right)^2+\lambda \norm{f}_{\mathcal{H}_{\overline{k}}}^2.
\end{align}
Using $\mathcal{H}_{\overline{k}}$,~\cite{blanchard2021domain} prove convergence in idealized risk of the estimator in Eq.~\eqref{eq:dom_gen_kernel_least_squares}.
More precisely, they prove in Theorem~15 and Corollary~16, for $g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^\lambda(P)(x):=f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^\lambda(P,x)$, the convergence
\begin{align}
\label{eq:domain_generalization_consistency}
\mathcal{E}^\infty(g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda})\to \inf_{g:\mathcal{M}_1^+(\mathcal{X})\to\left\{f:\mathcal{X}\to\mathbb{R}\right\}}\mathcal{E}^{\infty}(g)
\end{align}
in probability for $N\to\infty$, when the sample sizes $n_1,...,n_N$ are randomly drawn,  under rather general conditions on $\overline{k},\mathcal{X}$ and under a suitable choice $\lambda=\lambda(N)$.
This consistency is interesting because it allows us to hope for a small target error (in expectation w.r.t.~the random draw of $P^T$) for a \textit{sufficiently large number} $N$ of source samples ${\bf z}^{(1)},\ldots,{\bf z}^{(N)}$ of sufficiently large sample sizes $n_1,\ldots,n_N$, respectively.

\section{Problem}
\label{sec:problem}

Two issues appear: The first issue, concerns the final predictor $g(\widehat{P}_\mathcal{X}^T)(.)=f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda(N)}(\widehat{P}_\mathcal{X}^T,\cdot)$ computed as defined in Eq.~\eqref{eq:dom_gen_kernel_least_squares}, which resides in the pre-defined space $\mathcal{H}_{k_\mathcal{X}}$ (e.g., in~\cite{blanchard2021domain} defined by a Gaussian kernel $k_\mathcal{X}$ with fixed bandwidth), see Remark \ref{rem:generality}.
The space $\mathcal{H}_{k_\mathcal{X}}$ therefore needs to be a good choice for all domains, which might be hard to find in practice.
For example the regression functions $f_{P^{(i)}},f_{P^{(j)}}$ of two domains $i,j\in\{1,\ldots,N\}$ can reside in two different RKHSs $\mathcal{H}_{k^{(i)}},\mathcal{H}_{k^{(j)}}$ with two well-known (or well learnable) kernels $k^{(i)},k^{(j)}$, but a priori guesses for aggregations of the two kernels might lead to unstable behavior of the regression in Eq.~\eqref{eq:dom_gen_kernel_least_squares}.
The second issue concerns the rate of the convergence in Eq.~\eqref{eq:domain_generalization_consistency}, which is unknown.
To the best of our knowledge, no domain generalization algorithm is known with quantified convergence rate of Eq.~\eqref{eq:domain_generalization_consistency}.

This work presents a domain generalization algorithm $A$ as in Eq.~\eqref{eq:domain_generalization_algorithm} (i.e., mapping the source samples ${\bf z}^{(1)},\ldots,{\bf z}^{(N)}$ to a function $g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}:\widehat{P}_\mathcal{X}^T\mapsto (f:x\mapsto y)$)
that allows one to choose different RKHSs $\mathcal{H}_{k^{(i)}}$ with kernels $k^{(i)}$ for each domain $i\in\{1,\ldots,N\}$, and, which has a quantified rate of the convergence in Eq.~\eqref{eq:domain_generalization_consistency} (i.e., of $\mathcal{E}^\infty(g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}})\to \inf_{g:\widehat{P}_\mathcal{X}^T\mapsto (f:x\mapsto y)} \mathcal{E}^\infty(g)$ for increasing number of samples ${\bf z}^{(1)},\ldots,{\bf z}^{(N)}$ with increasing sizes.
    
\section{Summary of Results}


\subsection{Linear Operator Ansatz} \label{subsec:linear_ansatz}
Our approach follows the general Ansatz that there is a linear operator $G:L^2(\mathcal{X})\to L^2(\mathcal{X})$ mapping, for every $P\in\mathcal{M}_1^+(\mathcal{X}\times \mathcal{Y})$ drawn from $E$, the \textit{kernel mean embedding}
\footnote{
It holds that $m_{P_\mathcal{X}}\in \mathcal{H}_k\subseteq L^2(\mathcal{X})$, the space of square-integrable functions on $\mathcal{X}$. 
The mapping $m:P_\mathcal{X}\mapsto m_{P_\mathcal{X}}$ is well-defined if the kernel $k$ is bounded
and it is injective if $k$ is universal~\cite{gretton2006kernel,sriperumbudur2010hilbert}.}
\begin{align}
\label{eq:kernel_mean_embedding}
    m_{P_\mathcal{X}}(\cdot):=\int_\mathcal{X} k(\cdot,x')\diff P_\mathcal{X}(x')
\end{align}
to the domain-specific \textit{regression function}
\footnote{The regression function $f_P$ is well-defined since $\mathcal{X}$ and $\mathcal{Y}$ are Polish spaces (as compact subsets of $\mathbb{R}^{d_1},\mathbb{R}^{d_2}$) and, therefore, every $P\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ can be factorized $P(x,y)=P(y|x) P_\mathcal{X}(x)$ in a conditional probability measure $P(y|x)$ and a marginal (w.r.t.~$\mathcal{X}$) probability measure $P_\mathcal{X}(x)$, see~\cite[Theorem~10.2.1]{dudley2018real}.}
\begin{align}
\label{eq:regression_function_task_specific}
f_P(\cdot):=\int_\mathcal{Y} y\diff P(y|\cdot),
\end{align}
such that
\begin{align}
\label{eq:linear_operator_ansatz}
    f_{P}(\cdot)=G \cdot m_{P_\mathcal{X}}(\cdot) + \varepsilon(\cdot),
\end{align}
where $\varepsilon$ is some functional \textit{noise} that is drawn independent from $P_\mathcal{X}$ according to some probability measure $\mathcal{N}\in \mathcal{M}_1^+(L^2(\mathcal{X}))$, has zero mean $\int_{L^2(\mathcal{X})}\varepsilon \diff \mathcal{N}(\varepsilon)\equiv 0$ and finite variance $\sigma^2:=\int_{L^2(\mathcal{X})}\norm{\varepsilon}_{L^2(\mathcal{X})}^2 \diff \mathcal{N}(\varepsilon)<\infty$.

\begin{remark}
    The main conceptual contribution of this paper is the Ansatz above, which recasts domain generalization as a problem of functional regression.
    More precisely, our Ansatz allows domain generalization by learning the operator $G$, which maps
    (functional) mean embeddings to domain-specific regression functions.
    This enables to estimate the regression functions in each domain by different kernels, and, to apply explicit finite-sample bounds from the field of functional regression, e.g.~\cite{mollenhauer2022learning,jin2022minimax,tong2022non}.
    The independence of the noise is for simplicity and can be removed at the price of slightly more involved proofs, see~\cite[Eq.~(1)]{jin2022minimax}.
\end{remark}
We further assume that the integral operator $G$ is of the form
\begin{align}
    \label{eq:integral_form_of_G}
    G \cdot m_{P_\mathcal{X}}(\cdot) := a_0(\cdot)+ \int_\mathcal{X} m_{P_\mathcal{X}}(x) \beta(\cdot, x)\diff x
\end{align}
with \textit{intercept} $a_0:\mathcal{X}\to\mathcal{Y}$ and \textit{slope} $\beta:\mathcal{X}\times\mathcal{X}\to \mathbb{R}$ that need to be learned from the data. 

\subsection{New Algorithm} \label{subsec:algorithm}

For simplicity, in the following, we assume equal sample sizes $n:=n_1=\ldots=n_N=n_T$.
Following our Ansatz in Eq.~\eqref{eq:linear_operator_ansatz} and Eq.~\eqref{eq:integral_form_of_G}, we propose the following two-step procedure:
\begin{enumerate}
    \item Regularized estimation of $f_{P^{(i)}}$ for every source sample ${\bf z}^{(i)}$ of domain $i\in\{1,\ldots,N\}$
    \begin{align}
        \label{eq:domain_specific_ridge_regression}
        f_{{\bf z}^{(i)}}^{\lambda_i}:=\argmin_{h\in\mathcal{H}_{k^{(i)}}} \sum_{j=1}^{n} (f(x_j^{(i)})-y_j^{(i)})^2 + \lambda_i \norm{f}^2_{\mathcal{H}_{k^{(i)}}}.
    \end{align}
   
   
   
   
   
    \item Regularized estimation of slope $\beta$ on
    functional data $(m_{{\bf x}^{(i)}},f_{{\bf z}^{(i)}}^{\lambda_i})_{i=1}^N$ with
   
    $m_{{\bf x}^{(i)}}:=m_{\widehat{P}^{(i)}_\mathcal{X}}$
\end{enumerate}
\vspace{-12pt}
\begin{align}
   
        \label{eq:functional_regression_of_slope}
        \beta_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}:=
        \argmin_{\beta(x,\cdot)\in \mathcal{H}_k}
        \frac{1}{N}\sum_{i=1}^N
        \norm{f_{{\bf z}^{(i)}}^{\lambda_i}-\int \beta(\cdot, x')m_{{\bf x}^{(i)}}(x')\diff x' }_{L^2(\mathcal{X})}^2+\lambda \int_\mathcal{X} \norm{\beta(x,\cdot)}_{\mathcal{H}_k}^2 \diff x.
    \end{align}
 \noindent
We define the final model $g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}:\mathcal{M}_1^+(\mathcal{X})\to\{f:\mathcal{X}\to\mathcal{Y}\}$ as required in Eq.~\eqref{eq:domain_generalization_function_g} by $g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}({P}_\mathcal{X})(x):=G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda} m_{P_\mathcal{X}}(x)$, where $G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}$ is the integral operator defined in Eq.~\eqref{eq:integral_form_of_G} with the slope $\beta_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}$.

\begin{remark} \label{rem:generality}
    Note that the final predictor $g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}(\widehat{P}^T_\mathcal{X})(.)$ defined above is not enforced to reside in an RKHS $\mathcal{H}_{k^{(i)}}$ defined by one of the pre-defined kernels $k^{(1)},\ldots,k^{(N)}$ but is allowed to take more general forms.
    In this way, our algorithm can be interpreted as an extension of the marginal transfer approach, which computes a predictor $g(\widehat{P}_\mathcal{X}^T)(\cdot)=f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda(N)}(\widehat{P}_\mathcal{X}^T,\cdot)$ by Eq.~\eqref{eq:dom_gen_kernel_least_squares}.
    According to the representer theorem, see e.g.~\cite[Theorem~1]{scholkopf2001generalized}, this predictor $f_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda(N)}(\widehat{P}_\mathcal{X}^T,\cdot)$ admits the representation
    \begin{align*}
        \sum_{i=1}^N \sum_{j=1}^{n_i} \alpha_{i,j} \cdot\overline{k}\left((\widehat{P}_\mathcal{X}^{(i)},x_j^{(i)}),(\widehat{P}_\mathcal{X}^T,\cdot)\right)
        = \sum_{i=1}^N \sum_{j=1}^{n_i} \underbrace{\alpha_{i,j}\cdot k_{\mathcal{M}_1^+(\mathcal{X})}(\widehat{P}_\mathcal{X}^{(i)}, \widehat{P}_\mathcal{X}^T)}_{\text{$=:\widetilde{\alpha}_{i,j}\in\mathbb{R}$}} \cdot k_\mathcal{X}(x_j^{(i)},\cdot)
    \end{align*}
    which resides, as a linear combination of kernel sections $k_\mathcal{X}(x_j^{(i)},\cdot)$, in the pre-defined RKHS $\mathcal{H}_{k_\mathcal{X}}$ (as, e.g., a Gaussian RKHS in~\cite{blanchard2021domain}).
\end{remark}

\subsection{Finite Sample Error Bound} \label{subsec:error_rates_o_notation}

For our algorithm defined in Subsection~\ref{subsec:algorithm}, we are able to provide finite sample bounds under certain assumptions, which can be essentially summarized by four categories: Classical regularity conditions on the involved kernels (mean embeddings, domain-specific regression, operator slope), classical assumptions on the effective dimension of the involved integral operators (domain-specific regression, operator slope learning), new assumptions on the (functional) data generating process, and new assumptions relating the domain-specific regression problems with the global functional regression.

Under these assumptions, it holds with probability at least $1-\delta$ that
\begin{align} \label{eq:error_rates_o_notation}
\mathcal{E}^\infty(g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda})- \inf_{g:\mathcal{M}_1^+(\mathcal{X})\to\left\{f:\mathcal{X}\to\mathbb{R}\right\}}\mathcal{E}^{\infty}(g) \leq \frac{\log \frac{4}{\delta}}{\delta^2} \mathcal{O}(N^{-\frac{1}{1+c_6}})\left(\mathcal{O}({n^{-\frac{1}{1+c_3}}})+\mathcal{O}(1)\right)
\end{align}
for some $0 \le c_3, c_6 \le 1$ independent of $P\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ and regularization parameter choices
$$\lambda=N^{-\frac{1}{1+c_6}}, \lambda_1=\cdots=\lambda_N=n^{-\frac{1}{1+c_3}}.$$

\begin{remark}
    Our finite sample bound in Eq.~\eqref{eq:error_rates_o_notation} accounts for properties of the RKHS $\mathcal{H}_k$ in which the slope $\beta\in\mathcal{H}_k$ is assumed to reside, see (A5) in Subsection~\ref{subsec:assumptions} below.
    However, it does not take into account the smoothness of $\beta$, which can be done by combining it with the methods from~\cite{mollenhauer2022learning}.
\end{remark}

\section{Finite Sample Error Bound}

In this Section, we detail our assumptions and the strategy for proving Eq.~\eqref{eq:error_rates_o_notation}.
All proofs can be found in Appendix~\ref{app:proofs}.
We start by introducing some further notation in Subsection~\ref{subsec:notation}.
Then, in Subsection~\ref{subsec:assumptions}, we summarize all assumptions, split into classical assumptions and some new assumptions.
In Subsection~\ref{subsec:preliminary_statements}, we prove some preliminary statements, which we use in Subsection~\ref{subsec:convergence_rates_result} to prove Eq.~\eqref{eq:error_rates_o_notation}.


\subsection{Notation}
\label{subsec:notation}

For an $n$-sized sample ${\bf z}:=(x_j,y_j)_{j=1}^n$ independently drawn according to some $P\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$, we denote by $P^n({\bf z}):=\bigotimes_{j=1}^n P(x_j,y_j)$.
We further denote by $\text{Supp}(E)\subseteq \mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$ the support of $E$.
For a bounded, compact and self adjoint operator $K$ on $L^2(P)$ with eigenvalues $(\theta_j)_{j=1}^{\infty}$, we denote the \textit{effective dimension} by
$$
\gamma_K(\lambda):=\text{Tr}((K+\lambda I)^{-1}K)=\sum_{j=1}^\infty \frac{\theta_j}{\lambda+\theta_j},
$$
see~\cite{caponnetto2007optimal}.
Let us also denote the \textit{covariance kernel} related to the sampling process of the empirical mean embeddings by
\begin{align*}
C(s,t)=\int_{\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})} \int_{\mathcal{X}^n} m_{{\bf x}}(s)m_{{\bf x}}(t) \diff P_\mathcal{X}^n({\bf x}) \diff E(P),
\end{align*}
and its associated integral operator by
$(G_C f)(\cdot):=\int_{\mathcal{X}} C(x,\cdot) f(x) \diff x$.
We also denote the operator $T_{k}:=G_{k}^{\frac12} G_C G_{k}^{\frac12}$ for $G_k f:=\int_\mathcal{X} k(\cdot,x) f(x)\diff x$. Functions of self adjoint operators (e.g. the square root) are defined via the spectral calculus.
The function $f_0 \in L^2(\mathcal{X}^2)$ is defined such that $\beta(t,.)=G_k^{\frac12}f_0(t,.)$ for $\beta\in\mathcal{H}_k$ as in Eq.~\eqref{eq:functional_regression_of_slope}, it is well defined if $k$ is universal.








\subsection{Assumptions} \label{subsec:assumptions}

Our assumptions can be grouped in two parts: The first part deals with assumptions that are used in related works.
We formulate them in a way such that they hold uniformly over all $P \in \text{Supp}(E)$.
The second part discusses assumptions specific for our setting.
 All enumerated constants are independent from $P\in \text{Supp}(E)$.
 \paragraph{Assumptions from Related Works}
\begin{enumerate}[label=(A{{\arabic*}})]
\itemsep0em 
\item \textit{Assumption on kernels:} All applied kernels $k:\mathcal{X} \times \mathcal{X} \to \mathbb{R}$ belong to a family $\mathcal{K}$ of continuous (on the compact $\mathcal{X}$) kernels and admit a uniform bound $\kappa^2:=\sup_{k \in \mathcal{K}} \sup_{x\in\mathcal{X}} |k(x,x)|$.
   
    \item \textit{Regularity conditions for domain-specific regression:} For every $P\in \text{Supp}(E)$ the corresponding regression function $f_P$ satisfies $f_{P}=G^{\frac12}_{k,P} g_P$ for some $k \in \mathcal{K}$, $g_P \in L^2(P)$ and $G_{k,P}(f) :=\int_\mathcal{X} f(x) k(\cdot, x)\diff P(x)$.
    Moreover, $\norm{g_P}_{L^2(P)} \le c_1$ for some $c_1>0$.
    \item \textit{Assumptions on effective dimensions for domain-specific regression:} There are $c_2>0,0<c_3\le 1$ such that for any $P \in \text{Supp}(E)$ and $\lambda>0$ and $k \in \mathcal{K}$, the effective dimension of $G_{k,P}$ satisfies $\gamma_{G_{k,P}}(\lambda) \le c_2 \lambda^{-c_3}$.
    \item \textit{Assumptions for functional regression:}
   
    The slope $\beta$ of $G$ in Eq.~\eqref{eq:integral_form_of_G} satisfies $\beta(x,.) \in \mathcal{H}_{k}$ with an universal kernel $k\in\mathcal{K}$ (which ensures $G_k (L^2(\mathcal{X}))=\mathcal{H}_k$) and admits a bound $\int_\mathcal{X} \norm{\beta(x,\cdot)}_{\mathcal{H}_{k}}^2 \diff x \le c_4$ for some $c_4>0$.
    \item \textit{Assumptions on effective dimensions for functional regression:} For $T_k$ as defined in Subsection~\ref{subsec:notation}, it holds that $\gamma_{T_k}(\lambda)$ satisfies $\gamma_{T_k}(\lambda) \le c_5 \lambda^{-c_6}$  for some $c_5>0,0<c_6\le 1$.
    \item \textit{Zero intercept:} It holds that $a_0\equiv 0$.
   
    \end{enumerate}

    \paragraph{Our Assumptions:}
    \begin{enumerate}
    [label=(B{{\arabic*}})]
    \itemsep0em
    \item \textit{Relation between distributions:} There are $c_{*} ,c^{*}>0$ such that for all $P\in \text{Supp}(E)$, we have that $\frac{1}{c_{*}} \norm{f}_{L^2(P)} \le  \norm{f}_{L^2(\mathcal{X})} \le c^{*} \norm{f}_{L^2(P)}$ for all $f \in L^2(P)$.
   \item \textit{Coercivity of operator $G_C$:} There exists $c_7>0$ such that $\norm{g}^2_{L^2(\mathcal{X}^2)}\leq c_7 \langle G_C g, g\rangle_{L^2(\mathcal{X}^2)}$ for all $g\in L^2(\mathcal{X}^2)$ with $G_C$ as defined in Subsection~\ref{subsec:notation}.
    \item \textit{Independence of estimation errors:} The distributions of the estimation errors $(f_{{\bf z}}^{\lambda}-f_{P})$ and $G(m_{{\bf x}}-m_{P})$ are for any ${\bf z}=({\bf x},y)$ drawn from $P$ (drawn from $E$) independent from the distribution of $m_{{\bf x}}$.
    \item \textit{Estimation errors are unbiased:} The estimation biases satisfy
    \begin{align}
    \label{eq:estimation_bias_zero_assumption}
        &\int_{\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})} \int_{\mathcal{X}^n} G\cdot (m_{{\bf x}}-m_{P_\mathcal{X}}) \diff P_\mathcal{X}^n({\bf x}) \diff E(P)\equiv 0\\
        &\int_{\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})} \int_{(\mathcal{X}\times\mathcal{Y})^n} (f_{{\bf z}}^{\lambda}-f_{P}) \diff P^n({\bf z}) \diff E(P)\equiv 0.
    \end{align}
\end{enumerate}

\begin{remark}    
   
   
    Assumptions (B1) and (B2) essentially relate differences in the $L^2$-spaces caused by drawing different distributions.
    Assumption (B3) allows easy separation of noise expectation from data expectations. (B3) can be relaxed using techniques from~\cite{jin2022minimax}, where the data generating model in Eq.~\eqref{eq:linear_operator_ansatz} is assumed without the independence assumption.
    Assumption (B4) can also be slightly relaxed by assuming zero bias conditioned at the draw of $P$~\cite{mollenhauer2022learning}.
    The assumptions (B1)--(B4) do not aim at
     an entirely exhaustive theoretical setup; but aim to lay the groundwork for new algorithms and analyses of domain generalization by functional regression.
\end{remark}

\subsection{Preliminaries}
\label{subsec:preliminary_statements}

Our finite sample bound relies on bounds from~\cite{tong2022non} for functional regression, which requires to bound the variances of the errors caused by finite sample approximation of mean embeddings $m_P$ and regression functions $f_P$.
This Section summarizes corresponding smaller statements preparing the ground for our finite sample bound.
The proofs are deferred to Appendix~\ref{app:proofs}.

\begin{lemma} \label{lemma:uniform bound_mean_embeddings}
    The kernel mean embedding $m_{P'}$ of any $P'\in\mathcal{M}_1^+(\mathcal{X})$ w.r.t.~a kernel $k\in\mathcal{K}$ is bounded in $L^2(P)$-norm for any $P\in\mathcal{M}_1^+(\mathcal{X})$ by
    \begin{align}
       &\norm{m_{P'}}_{L^2(P)}\leq \kappa^4. \label{eq:uniform bound mean embeddings} 
    \end{align}
\end{lemma} 
\noindent
The next Lemma~\ref{lemma:variance_bound_for_ridge_regression} follows from~\cite[Theorem~2]{guo2017learning}.

\begin{lemma}
    \label{lemma:variance_bound_for_ridge_regression}
    Let $P\in\text{Supp}(E)$, $n \in \mathbb{N}$ and assume (A2)--(A3).
    Then, for $\lambda=n^{-\frac{1}{c_3+1} }$, we have that
    \begin{align}
    \label{eq:variance_bound_for_ridge_regression}
         \int_{(\mathcal{X}\times\mathcal{Y})^{n}} \norm{f_{{\bf z}}^{\lambda}-f_{P}}_{L^2(\mathcal{X})}^2 \diff P^n({\bf z}) \leq (c^{*})^2 c_8 n^{-\frac{1}{c_3+1} },
    \end{align}
    for some $c_8>0$ that is independent from $P$.
\end{lemma}

\begin{lemma}[\cite{wolfer2022variance}, Section 2, Remark 2.1]
    \label{lemma:estimation_bound_mean_embeddings}
    For $P\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$, $n \in \mathbb{N}$ and $k \in \mathcal{K}$, we have that
    \begin{align}
    \label{eq:estimation_bound_mean_embeddings}
        \int_{\mathcal{X}^n} \norm{m_{{\bf x}}-m_{P_\mathcal{X}}}_{\mathcal{H}_{k}}^2 \diff P_\mathcal{X}^n({\bf x}) \leq \frac{\kappa^2}{n}.
    \end{align}
\end{lemma}

\noindent
Lemma~\ref{lemma:estimation_bound_mean_embeddings} leads to the following variance bound.

\begin{lemma}
    \label{lemma:variance_mapped_mean_embedding}
  For $P\in\mathcal{M}_1^+(\mathcal{X}\times\mathcal{Y})$, $n \in \mathbb{N}$ and $k \in \mathcal{K}$, we have that
    \begin{align}
        \label{eq:variance_mapped_mean_embedding}
        \int_{\mathcal{X}^n}
        \norm{G\cdot(m_{{\bf x}}-m_{P_\mathcal{X}})}_{L^2(\mathcal{X})}^2
        \diff P_\mathcal{X}^n({\bf x}) \leq \frac{c_4 \kappa^6}{n}.
    \end{align}
\end{lemma}

\subsection{Convergence Rates Result}
\label{subsec:convergence_rates_result}

Now we continue our investigations concerning Eq.~\eqref{eq:error_rates_o_notation}.
We need to analyze the difference 
\begin{align}\label{eq:error_decomp_1}
\mathcal{E}^\infty(g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}) &- \inf_{g:\mathcal{M}_1^+(\mathcal{X})\to\left\{f:\mathcal{X}\to\mathbb{R}\right\}}\mathcal{E}^{\infty}(g)=\nonumber 
\\= &\int_{\mathcal{M}_1^+(\mathcal{X})} \int_{\mathcal{X}\times\mathcal{Y}} \left(G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda} m_{P_\mathcal{X}}(x)-y\right)^2\diff P(x,y) \diff E(P)\nonumber
\\ &\quad\quad\quad-\int_{\mathcal{M}_1^+(\mathcal{X})} \int_{\mathcal{X}\times\mathcal{Y}} \left(f_{P}(x)-y\right)^2\diff P(x,y) \diff E(P) \nonumber \\
= &\int_{\mathcal{M}_1^+(\mathcal{X})}  \norm{G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda} m_{P_\mathcal{X}}(x)-f_{P}(x)}_{L^2(P)}^2 \diff P({\bf x}) \diff E(P)\nonumber \\
=&\int_{\mathcal{M}_1^+(\mathcal{X})} \int_{\mathcal{X}} \norm{G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda} m_{P_\mathcal{X}}-G m_{P_\mathcal{X}}}_{L^2(P)}^2  \diff P({\bf x}) \diff E(P),
\end{align}
where the second equality follows from the bias-variance decomposition (see e.g. \citet[Proposition~1]{cucker2002mathematical}) and the last equality from our linear operator Ansatz in Subsection~\ref{subsec:linear_ansatz}.
In order to further analyze Eq.~\eqref{eq:error_decomp_1}, we use assumption (B1) from Subsection~\ref{subsec:assumptions} and obtain
\begin{align} \label{eq:error_decomp_2}
(c_{*})^2 \norm{G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}-G}^2_{\text{Op}(L^2(\mathcal{X}))}
  \norm{m_{P_\mathcal{X}}}_{L^2(\mathcal{X})}^2,
\end{align}
where $\norm{\cdot}_{\text{Op}(L^2(\mathcal{X}))}$ denotes the operator norm on $L^2(\mathcal{X})$.
As $\norm{m_{P_\mathcal{X}}}_{L^2(\mathcal{X})}$ can be uniformly bounded (using Lemma~\ref{lemma:uniform bound_mean_embeddings}), we only need to care about $\norm{G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}-G}^2_{\text{Op}(L^2(\mathcal{X}))}$. This Hilbert-Schmidt norm relates to the $L^2(\mathcal{X})$-norm of the difference between the corresponding slope functions, which is used by the following key lemma, together with assumption (B2) and methods from~\cite{tong2022non}.







\begin{lemma}
\label{lemma:operator_norm}
Consider the algorithm introduced in Subsection \ref{subsec:algorithm}.
Under the assumptions stated in Subsection \ref{subsec:assumptions}, if we set $\lambda=N^{-\frac{1}{1+c_{6}}}$ and $\lambda_i=n^{-\frac{1}{1+c_{3}}}$ for $i=1,...,N$, we have that for any $0<\delta<1$ with probability $1-\delta$: 
\begin{align*}
\norm{G_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda}-G}^2_{\mathrm{Op}(L^2(\mathcal{X}))} \le c_7 C(\Bar{\sigma}^2)\frac{\log \frac{4}{\delta}}{\delta^2} N^{-\frac{1}{1+c_6}},
\end{align*}
where 
 \begin{align*}
 C(\Bar{\sigma}^2)&=2 \left(\frac{\Bar{\sigma}^2\left(2 \kappa^5   \left(\kappa^5  +\sqrt{c_5}\right)+1\right)^6}{\kappa^{10} }+\left(2 \kappa^5  \left(\kappa^5 +\sqrt{c_5}\right)+1\right)^2 \norm{f_0}_{L^2(\mathcal{X}^2)}^2\right),
 \end{align*}
 and
\begin{align} \label{eq:bound_variance_total}
\Bar{\sigma}^2 = (c^{*})^2 c_8 n^{-\frac{1}{c_3+1} }+ c_4 \kappa^6 n^{-1} +\sigma^2.
\end{align}
\end{lemma}
Applying Lemma~\ref{lemma:operator_norm} to Eq.~\eqref{eq:error_decomp_2}, and combining it with the variance bounds in lemma~\ref{lemma:variance_bound_for_ridge_regression} and Lemma~\ref{lemma:variance_mapped_mean_embedding} results in our main finite sample bound.

\begin{theorem}
Consider the algorithm introduced in Subsection \ref{subsec:algorithm}.
Under the assumptions stated in Subsection \ref{subsec:assumptions}, if we set $\lambda=N^{-\frac{1}{1+c_{6}}}$ and $\lambda_i=n^{-\frac{1}{1+c_{3}}}$ for $i=1,...,N$, we have that for any $0<\delta<1$ with probability $1-\delta$:
\begin{align} \label{eq:main_bound}
\mathcal{E}^\infty(g_{{\bf z}^{(1)},\ldots,{\bf z}^{(N)}}^{\lambda_1,\ldots,\lambda_N,\lambda})- \inf_{g:\mathcal{M}_1^+(\mathcal{X})\to\left\{f:\mathcal{X}\to\mathbb{R}\right\}}\mathcal{E}^{\infty}(g) \le C'(n) \frac{\log \frac{4}{\delta}}{\delta^2} N^{-\frac{1}{1+c_6}},
\end{align}
where
\begin{align}
\label{eq:main_constant}
C'(n)=c_{*}^2  \kappa^8 C(\Bar{\sigma}^2)=~&\frac{2\kappa^8 c_{*}^2 c_7 ((c^{*})^2 c_8 n^{-\frac{1}{c_3+1} }+ c_4 \kappa^6 n^{-1} +\sigma^2 )(2 \kappa^5   (\kappa^5  +\sqrt{c_5})+1)^6}{\kappa^{10} }\nonumber\\
&+2\kappa^8 c_{*}^2 c_7 (2 \kappa^5  (\kappa^5 +\sqrt{c_5})+1)^2 \norm{f_0}_{L^2(\mathcal{X}^2)}^2.
\end{align}
\end{theorem}


\section{Numerical Example}

The goal of our numerical example is (a) to underpin the potential of the proposed functional regression approach for domain generalization, and, (b) to illustrate an implementation of the algorithm in Subsection~\ref{subsec:algorithm}, which we also provide in python.

\paragraph{Data Generation}
We generated $N=100$ input source samples $({\bf x}^{(i)})_{i=1}^N\in (\mathcal{X})^{n\times N}$, $\mathcal{X}=[0,1]$, each of size $n=100$, drawn independently from $N$ different truncated Normal distributions $(P_\mathcal{X}^{(i)})_{i=1}^N$, see Eq.~\eqref{eq:source_samples}.
The values of the truncated Normal distributions are generated to lie in the compact interval $[\mu^{(i)}-0.3, \mu^{(i)}+0.3]$ with means $\mu^{(i)}:=\int_\mathcal{X} x\diff P_\mathcal{X}^{(i)}(x)$ generated independent and uniformly distributed in the interval $[0.3,0.7]$ and variances generated in the interval $[0.025, 0.125]$, see Figure~\ref{fig:numerical_illustration}(a).
The outputs $({\bf y}^{(i)})_{i=1}^N\in (\mathbb{R})^{n\times N}$, $\mathcal{Y}=\mathbb{R}$, corresponding to the inputs $({\bf x}^{(i)})_{i=1}^N\in \mathbb{R}^{n\times N}$, are generated according to the equation
\begin{align*}
y_j^{(j)} = \frac{1}{10} \sin\!\left(\frac{3 x_j^{(i)}}{(\mu^{(i)})^2}\right) + \frac{9}{10} - \left(1.7 \left(x_j^{(i)}-\frac{1}{2}\right)\right)^2 + \epsilon_j^{(i)},
\end{align*}
where $(\epsilon_j^{(i)})_{i=1}^N$ are independently drawn from a Normal distribution with mean $0$ and variance $0.02$, see Figure~\ref{fig:numerical_illustration}(b).

\paragraph{State-of-the-art Baselines}
Recall the goal of domain generalization to learn, from the source samples $({\bf x},{\bf y})_{i=1}^N$ a model $g:\mathcal{M}_1^+(\mathcal{X})\to\{f:\mathcal{X}\to\mathcal{Y}\}$ that performs well on data from new \textit{target} distributions.
In our example, this means, that $g$ needs to be computed from the data $\left({\bf x}^{(i)},{\bf y}^{(i)}\right)_{i=1}^N$ described above.
Figure~\ref{fig:numerical_illustration}(c) shows as a dashed line the prediction of a single ridge regression Eq.~\eqref{eq:domain_specific_ridge_regression} on the \textit{pooled} data $\left(x_j^{(i)},y_j^{(i)}\right)_{i\in\{1,\ldots,N\},j\in\{1,\ldots,n\}}$, which serves as the baseline representing the state of the art.
We also implemented the approach in~\cite{blanchard2021domain}, but it was not able to outperform the pooling procedure, although an intensive parameter search was performed as follows.

\sloppy
Following~\cite{blanchard2021domain}, the parameter $\lambda$ and the kernel of the RKHS of the ridge regression for pooling were chosen by $5$-fold cross-validation on a grid of values, $\lambda\in\{10^{-1},10^{-2},\ldots,10^{-6}\}$ and the kernel either as Gaussian kernel $k(x,y)=e^{-\frac{|x-y|^2}{2 l^2}}$ with $l\in\{1,5,10\}$ or as periodic kernel $k(x,y)=e^{-\frac{2 \sin^2(\pi |x-y|/p)}{l^2}}$ with $l\in\{1,10^{-1},10^{-2}\}, p=1$.
We followed the same procedure for choosing the parameters and the kernels used in our implementation of the marginal transfer approach of~\cite{blanchard2021domain} in Subsection~\ref{subsec:marginal_transfer_learning}.
In particular, we chose the kernel for computing the applied empirical kernel mean embeddings as Gaussian kernel with $l\in\{10^{-2},10^{-3}\}$ or the periodic kernel with $l=1, p=1$, we choose the kernel $k_\mathcal{X}$ either as the Gaussian kernel with $l\in\{10^{3},10^2,\ldots,10^{-3}\}$ or the periodic kernel with $l\in\{1,10^{-1},10^{-2}\}, p=1$, and the kernel between mean embeddings as the Gaussian kernel with $l\in\{1,10^3,10^{-3}\}$.
The regularization parameter was chosen as $\lambda\in\{10^{3},10^{2},\ldots,10^{-4}\}$.

 \begin{figure}[t]
 \centering
 \includegraphics[width=0.8\textwidth]{img/figure.jpg}
 \caption{Our approach maps kernel mean embeddings of input distributions (a) to regression functions (b, dashed) and allows to outperform ridge regression on pooled data (c, dashed), in contrast to~\cite{blanchard2021domain} (also c, dashed), which is illustrated by four random test predictions of our approach (d, dashed).}
 \label{fig:numerical_illustration}
\end{figure}

\paragraph{Implementation of Functional Regression Approach}
The implementation of our general functional regression algorithm described in Subsection~\ref{subsec:algorithm} has two main steps.

The first step is ridge regression on each source distribution $P^{(i)}$ to compute an estimator $f_{{\bf z}^{(i)}}^{\lambda_i}$.
In this step, the full potential of our approach can be seen, as for each $i\in\{1,\ldots,N\}$, a different RKHS can be learned using cross-validation with $\lambda_i\in\{10^{-1},\ldots,10^{-4}\}$ and kernels being either a Gaussian kernel with $l\in\{10^{-1},\ldots,10^{-4}\}$ or a periodic kernel with $l\in\{1,10^{-1},10^{-2}\}$ and $p\in\{1,2,3,5,10\}$.

The second step is penalized estimation according to Eq.~\eqref{eq:functional_regression_of_slope}. In this step, we follow step-by-step the Algorithm~1 in Section~4 of~\cite{tong2022non}.
This algorithm is essentially ridge regression, but it requires to estimate functions instead of scalar weights for the kernel sections in the solution granted by the representer theorem, see e.g.~\cite[Eq.~(16)]{scholkopf2001generalized}.
One particular difficulty which has to be mentioned at this point is the estimation of the $L^2([0,1])$-norm, which is required in this step.
This is done in~\cite{tong2022non} by discretization of the interval $[0,1]$, which is simple in our numerical example using $1000$ equally distributed grid points, but it suffers from the curse of dimensionality when the input data dimension increases.

\paragraph{Result}
Figure~\ref{fig:numerical_illustration}(d) shows some predictions (dashed lines) of our algorithm on new target distributions (not included in the source distributions).
Although the predictions are not perfect, they clearly outperform the dashed line of the simple pooling approach (and also the algorithm of~\cite{blanchard2021domain}).
We also measured the difference in empirical least-squares test error.
On average (over $20$ new test distributions),
the dashed regression functions illustrated in Figure~\ref{fig:numerical_illustration}(b) (lower bound) achieve an error of $0.0008$, the gray parabola in Figure~\ref{fig:numerical_illustration}(c) (upper bound) has an error of $0.0062$, the simple pooling approach and marginal transfer learning achieves an error of $0.0042$, and our implementation achieves an error of $0.0029$.




\section{Conclusion and Future Work}

In this work, we study domain generalization as a problem of functional regression, i.e., regression with functional input and output, by directly learning the relationship between domain-specific marginal distributions of inputs and corresponding conditional distributions of outputs given inputs.
Our new conceptualization leads to an operator learning algorithm with finite sample bounds.
We also provide numerical illustrations showing its advantage of explicit computation of domain-specific predictors in possibly different reproducing kernel Hilbert spaces.

Our work aims at setting the ground for (to the best of our knowledge first) finite sample bounds for domain generalization.
However, it leaves an entirely exhaustive statistical analysis for future research, e.g., by taking the smoothness of the operator slope and more general non-linear functional regression, into account.


\acks{We acknowledge the ELLIS Unit Linz, the LIT AI Lab, and the Institute for Machine Learning at the University of Linz. In addition, the research reported in this paper has been partly funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK), the Federal Ministry for Digital and Economic Affairs (BMDW), and the Province of Upper Austria in the frame of the COMET--Competence Centers for Excellent Technologies Programme and the COMET Module S3AI managed by the Austrian Research Promotion Agency FFG.}

