% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{abbrvnat}
        % \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example


% \usepackage[colorlinks=true,linkcolor=black, citecolor=blue]{hyperref}
%\usepackage{hyperref}
% \usepackage{xr-hyper}
\usepackage{url}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{subfigure}

\usepackage{amssymb}




\usepackage{tikz}
% Tikz settings optimized for causal graphs.
% Just copy-paste this part
\usetikzlibrary{shapes, decorations,arrows,calc,arrows.meta,fit,positioning}
\tikzset{
    -Latex,auto,node distance =1. cm and 1. cm,semithick,
    state/.style ={ellipse, draw, minimum width = 0.4 cm},
    point/.style = {circle, draw, inner sep=0.04cm,fill,node contents={}},
    directed/.style={Latex-Latex,dashed},
    el/.style = {inner sep=2pt, align=left, sloped}
}

\def\z{{\phi(Z)}}
\def\E{{\mathcal {E}}}
\def\Ex{{\mathbb{E}}}
\def\H{{\mathcal{H}}}
\def\X{{\mathcal{X}}}
\def\Y{{\mathcal{Y}}}
\def\Z{{\mathcal{Z}}}
\def\L{L}
\def\A{{\mathcal{A}}}
\def\B{{\mathcal{B}}}
\def\N{{\mathcal{N}}}
\def\sumni{\sum_{i=1}^{n_1}}
\def\sumnii{\sum_{i=1}^{n_2}}
\def\sumnj{\sum_{j=1}^{n_2}}
\def\summ{\sum_{l=1}^m}
\def\a{{\text{a}}}
% \def\T{\text{T}}
\def\T{\top}
\def\op{O_p}
\def\di{{d_1}}
\def\dii{{d_2}}
\def\para{{||}}
\def\res{{\bot}}
\def\S{\mathbb S}


\newcommand{\R}{{{\mathbb R}}} 
%\newcommand{\E}{{{\mathbb E}}}
\newcommand{\PP}{{{\mathbb P}}} 
\newcommand{\cP}{{{\mathcal P}}} 

\newcommand{\F}{{{\mathcal F}}} 
\newcommand{\cH}{{{\mathcal H}}} 
\newcommand{\cE}{{{\mathcal E}}}

\usepackage{comment}
% \theoremstyle{definition}
\newtheorem{assumption}{Assumption}
% \newtheorem*{theorem*}{Theorem}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{remark}{Remark}
\newtheorem{example}{Example}
\newtheorem{definition}{Definition}
\newtheorem{corollary}{Corollary}
% \newtheorem*{corollary*}{Corollary}
\newtheorem{condition}{Condition}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\usepackage{cleveref}
\usepackage{todonotes}

\newcommand{\wx}[1]{\textcolor{black}{#1}}
\newcommand{\wq}[1]{\textcolor{orange}{#1}}
\newcommand{\td}[1]{\textcolor{magenta}{#1}}


% \usepackage{xr-hyper} 
% \externaldocument{shi_139-supp}


\title{Learning Nonlinear Causal Effects via Kernel Anchor Regression}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Wenqi Shi}
\author[2]{Wenkai Xu}
% \author[3]{Further~Coauthor}
% \author[1]{Further~Coauthor}
% \author[3]{Further~Coauthor}
% \author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Department of Industrial Engineering,
    
    Tsinghua University
    % Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Department of Statistics,
    
    University of Oxford
}
  
\begin{document}

\maketitle

\begin{abstract}
Learning causal {effects}
% relationships  
% between variables 
is a fundamental problem in science. Anchor regression has been developed to address this problem for a large class of causal graphical models, 
though the relationships between the variables are assumed to be linear. 
% allowing distribution shift from observational data.
In this work, we tackle the nonlinear setting by proposing kernel anchor regression (KAR).
% an nonparametric procedure. 
Beyond 
% the natural formulation using 
a classic two-stage least square 
(2SLS) 
estimator, we also study an improved variant that involves nonparametric {kernel} regression in three separate stages.
% estimation procedure that can outperform the two-stage counterpart in certain scenarios. 
We provide 
% theoretical
convergence results
% the consistency and convergence rate 
for the
proposed 
KAR estimators
% We also 
and the identifiability conditions for
% the proposed 
KAR to learn the nonlinear structural equation models (SEM).
Experimental results demonstrate the superior performances of the proposed KAR estimators over existing baselines.
\end{abstract}

\section{Introduction}\label{sec:intro}
% Causal 
% % relationships
% {effects} are concerned with consequences of actions or decisions;
% thus, understanding these 
% % relationships #
Understanding the causal
effects can be a key ingredient in many scientific studies. For instance, medical practitioners need to know 
how effective 
% whether 
a treatment is 
% effective 
to the target disease in clinical trials; econometricians ask 
% whether
how much change a particular purchasing behaviour drives 
% a change in 
the Consumer Price Index (CPI); epidemiologists want to understand 
to what extent a government 
% intervention
policy 
% has a positive effect on
can alleviate the pandemic. 
While the goal of revealing causal effects remains the same, the focus in the notion of causality may differ due to specific applications. 
To describe different aspects of the causal notions and design corresponding statistical procedures for inferring the causal effects, various frameworks have been developed including 
%Rubin's 
the potential outcomes framework \citep{rubin2004direct,rubin2005causal}, 
counterfactual distributions \citep{chernozhukov2013inference} and 
% structured causal models (SCM) 
Pearl's causal graphical models \citep{pearl2000models,pearl2016causal}. A succinct yet comprehensive introduction can be found in \citet{peters2017elements}. 


Causality has also been an emerging field in machine learning community and
% Recently, 
machine learning techniques have been studied to improve the procedures for learning the causal effects. In particular, 
% nonparmetric 
independence \citep{gretton2005measuring} and conditional independence  \citep{fukumizu2007kernel} measures have been exploited to infer causal graphical models \citep{colombo2012learning, mooij2009regression}, especially for 
% the setting where the noise is 
the additive noise setting 
\citep{hoyer2008nonlinear, peters2014causal}.   Independent Component Analysis (ICA) 
% methods
\citep{hyvarinen2013independent, hyvarinen2017nonlinear} has been employed to identify causal relationships in both linear \citep{hyvarinen2010estimation, shimizu2006linear, shimizu2011directlingam} and non-linear settings \citep{monti2020causal, khemakhem2021causal}. Score matching \citep{hyvarinen2005estimation} has also been considered \citep{rolland2022score} for non-linear causal learning.
% discovery.
Moreover, kernel methods, that utilize rich representation of reproducing kernel Hilbert space (RKHS), have been applied to tackle nonparametric estimation \citep{muandet2021counterfactual,singh2019kernel} and regression \citep{singh2019kernel, zhu2022causal} problems with causal implications.
Deep neural networks have also been attempted for  learning treatment effect \citep{johansson2020generalization, kallus2020deepmatch, louizos2017causal} or useful causal representations \citep{besserve2019counterfactuals,scholkopf2021toward, xu2020learning,xu2021deep}.


% The approach for causal notion remains un-unified, while existing attempts has achieved interesting theoretical guarantees and interpretable results based on graphical models, i.e. the causal graph \citep{pearl2000causality}.


% A loose definition: $X$ causes $Y$ whenever a change in $X$ results in change in $Y$.
Recently, an elegant and statistically robust approach formulates causality as an invariant risk minimization (IRM), see for example \citep{buhlmann2018invariance, peters2016causal}. The causal structure is thought to be invariant across the environment and robust under intervention. The IRM learning procedure \citep{arjovsky2019invariant} on the observational data is then formulated as a regularized empirical risk minimization (ERM) to achieve both in-distribution
performance and out-of-distribution generalization.
In particular, anchor regression \citep{rothenhausler2018anchor},
{closely related to K-class estimators \citep{jakobsen2022distributional},}
has been developed under the IRM framework to tackle a very general class of causal graphical models with the confounders being partly 
(but not fully)
observed. By choosing different regularization parameter, anchor regression is able to unify the ordinary least square (OLS) regression, partialling out (PA) regression, and instrumental variable (IV) regression. While existing works mostly considered linear cases \citep{oberst2021regularizing,rothenhausler2018anchor}, we explore the non-linear setting for anchor regression \citep{kook2022distributional}. Specifically, we consider the nonparametric estimation to tackle non-linear features via RKHS functions.
{Although nonlinear anchor regression may not perform well in terms of generalization \citep{christiansen2021causal}, we show that the approach is valuable as it can identify nonlinear causal effects under certain conditions when confounders are only partially observed, and in certain setting, it can outperform other nonlinear methods in terms of MSE.}



This paper is structured as follows. In   \Cref{sec:background}, we review useful concepts including instrumental variable (IV), anchor regression (AR), and reproducing kernel Hilbert space (RKHS).
% multiple and
% structural equation model (SEM). 
Then we develop 
two versions of kernel anchor regression (KAR) estimators in \Cref{sec:kar}. Theoretical analysis on the estimators and the causal interpretation with nonlinear SEM are provided in \Cref{sec:analysis}. Experimental results for synthetic data and real-world applications are shown in \Cref{sec:simulation} followed by  concluding discussion and future directions in \Cref{sec:conclusion}. Code for the experiments is available at \url{https://github.com/Swq118/Kernel-Anchor-Regression}.
% attached in the supplementary material.
% available in the attachment.


\section{Background}\label{sec:background}


Directed Acyclic Graph (DAG) is a power class of graphical model for characterising conditional dependency structures and has been widely used for probabilistic modelling 
such as hidden Markov models
% (HMM) 
\citep{rabiner1986introduction}, latent variable models \citep{bishop1998latent} and topic models \citep{blei2012probabilistic}.
By enforcing certain Markov and faithfulness assumptions \citep{peters2011identifiability}, as well as noise structures \citep{hoyer2008nonlinear}, DAG models the causal relationships \citep{glymour2019review,spirtes2013causal} 
% (an example shown in \Cref{fig:anchor}) 
and the learning procedures have been developed \citep{colombo2012learning,spirtes2000causation,zhang2018learning}.
% Moreover, high-dimensional DAG settings have also been studied \citep{colombo2012learning}.
% With specific assumption on how the noise 

\paragraph{From Instrumental Variable 
% Regression
to Anchor Regression}
\quad
Instrumental variable (IV) has been developed to incorporate endogenous explanatory variables in econometrics \citep{bowden1990instrumental} and then applied for estimating causal effect 
\citep{angrist1996identification}.
Consider the linear regression problem $Y = X \beta + \epsilon$. OLS assumes 
independence between noise $\epsilon$ and explanatory $X$ (the exogenous variable) and $\beta$ is estimated via minimizing
\begin{equation}\label{eq:ols_obj}
\beta^{OLS} = \argmin_{\beta}\mathbb E_{train}[\|Y-X\beta\|^2].
\end{equation}
% While the OLS assumes 
% independence between noise $\epsilon$ and explanatory $X$ 
% % $X\perp \epsilon$
% (the exogenous variable), 
The IV setting assumes explicit dependency between $X$  and $\epsilon$ via instrumental variable $Z$, i.e. $X = Z\theta + \varepsilon$ where $Z \perp\varepsilon$.
The two-stage least squares (2SLS)  procedure, widely used in economics, tackles the linear IV estimation by first regressing $Z$ over $X$ to get conditional means $\bar{X}(z) := \mathbb E[X|Z=z]$ and secondly 
% linearly 
regressing outputs $Y$ on these conditional
means\footnote{ Writing
% $Y = X \beta + \epsilon  = 
% \underset{\mathbb E[{X}| Z]}{\underbrace{Z\theta }}\beta + (\varepsilon\beta + \epsilon)
% % {{(Z\theta)}}\beta + (\varepsilon\beta + \epsilon) = {\mathbb E[{X}| Z]}\beta + (\varepsilon\beta + \epsilon) 
% ,$
$Y = X \beta + \epsilon  = {{(Z\theta) }}\beta + (\varepsilon\beta + \epsilon)$  
% for the second stage, 
where $Z\theta = \Ex[X|Z]$.
the regressor is independent of noise and OLS apply.}.
This corresponds to minimizing the projected least square objective, 
\begin{equation}\label{eq:iv_obj}
\beta^{IV} = \argmin_{\beta}\mathbb E_{train}[\|P_Z(Y-X\beta)\|^2].
\end{equation}
Let $P_Z$ denote the projection to $Z$ where $P_{Z=z}(X) = \mathbb E [X|Z=z] = \bar X(z)$.
2SLS works well when the underlying assumptions hold. 
% 
The corresponding DAG is shown in \Cref{fig:anchor} with only solid lines. 
In practice, the relation between
$Y$ and $X$ may not be linear, nor may be the relation between $X$ and $Z$. {Non-parametric IV has been explored through moment estimations \citep{dikkala2020minimax}} as well as using deep neural networks \citep{bennett2019deep, centorrino2019nonparametric,hartford2017deep, singh2019kernel, xu2020learning,zhu2022causal}.



\begin{figure}[t!]
    \centering
\begin{tikzpicture}
    \node[state] (1) {$Z$};
    \node[state] (2) [right =of 1] {$X$};
    \node[state] (3) [right =of 2] {$Y$};
    \node[state] (4) [above right =of 2,xshift=-0.9cm,yshift=-0.3cm] {$C$};

    \path (1) edge node[above] {} (2);
    \path (2) edge node[above] {} (3);
   % \path[bidirected] (2) edge[bend left=60] node[above] {} (3);
    \path (4) edge node[el,above] {} (2);
    \path (4) edge node[el,above] {} (3);
    \path (1) edge[dashed, bend right=30] node[above] {} (3);
    \path (1) edge[dashed, bend left=10] node[above] {} (4);
\end{tikzpicture}
\caption{DAG representations for IV regression (solid lines only) and anchor regression  (with dashed lines).
}\label{fig:anchor}
\vspace{-0.15cm}
\end{figure}

However, $Y$'s dependency on $Z$ may not be solely through $X$, i.e. as the dashed lines from $Z$ to $Y$ in \Cref{fig:anchor} indicate, $Y$ may depend on $Z$ directly, 
%even though 
and the strength of such dependency may remain unknown. The latent confounder $C$ may not be independent of 
%the measured confounder 
$Z$, as indicated by dashed line from $Z$ to $C$ in \Cref{fig:anchor}. Incorporating such dependency structures tackles a much more general class of DAG, e.g. IV is a special case. To estimate $\beta$, anchor regression has been proposed \citep{rothenhausler2018anchor} to effectively combines \Cref{eq:ols_obj} and \Cref{eq:iv_obj}. 
% Denote $\mathbb E_{train}$ by $\mathbb E$ when no ambiguity arise. 
For 
% chosen
regularization parameter $\gamma$ and identity operator $Id(Z):=Z$, 
\begin{align}
\beta^{\gamma} = \argmin_{\beta}\mathbb E_{train}[\|(Id-P_Z)(Y-X\beta)\|^2] \label{eq:ar_obj1}\\
% \mathbb E\|(Y-X\beta)\|^2 
+ \gamma \mathbb E_{train}[\|P_Z(Y-X\beta)\|^2].
\label{eq:ar_obj2}
\end{align}

Here, $\gamma\geq 0$ can be thought of the level of direct dependency of $Y$ on $Z$ variable\footnote{The smaller $\gamma$ value dashed line, the stronger the dependency, i.e. the more solid dashed line from $Z$ to $Y$.}.
By setting different $\gamma$ values, anchor regression 
recovers
%resembles 
classical settings, i.e. $\gamma=1$ corresponds to OLS, $\beta^1 = \beta^{OLS}$;
$\gamma\to \infty$ corresponds to IV, $\beta^{\to \infty}:= \lim_{\gamma \to \infty}\beta^{\gamma} = \beta^{IV}$;
$\gamma=0$ corresponds to the "partialling out" setting where only residuals between regression of $Z$ to $X$ and $Y$ are of interest. 


% \subsection{Non-linear feature learning via RKHS}
\paragraph{Kernel-based 
% nonparametric 
Methods}

% Kernel methods employ 
Functions in RKHS has been employed to tackle various statistical and machine learning tasks with nonlinear features \citep{hofmann2008kernel}, e.g. kernel ridge regression, support vector machine,
% \citep{scholkopf2018learning,steinwart2008support}, 
etc.
RKHS functions have also been 
% developed
utilized to represent and characterize distributions, via kernel mean embedding \citep{muandet2017kernel}. For probability measure $p$ and kernel $k$ associated with RKHS $\H$, the mean embedding denoted by $\mu_p := \int k(x,\cdot) dp(x) \in \H$ has been widely used to compare distributions, e.g. via maximum-mean-discrepancy (MMD) \citep{gretton2012kernel}.
% With $p$ being a conditional distribution, 
Conditional mean embedding \citep{song2009hilbert} has also been considered for learning in regression problems \citep{fukumizu2007kernel,grunewalder2012conditional}.
% Various techniques have also been developed to formulate and learn operators relating to 
% % manipulate
% conditional mean embedding \citep{fukumizu2007kernel,grunewalder2012conditional}.
% 
With the rich representations,
% nonlinear features, 
RKHS functions are also applicable of learning distribution directly via distribution regression \citep{szabo2015two, szabo2016learning}.
% whose analysis and techniques are closely related to what we are using to analyse the KAR estimators.
% \wx{include citations from \cite{dikkala2020minimax}, \cite{christiansen2021causal}, \cite{jakobsen2022distributional}.}

\section{Kernel Anchor Regression}\label{sec:kar}
To capture the non-linear features in the DAG, 
% in the causation,
we kernelize the anchor regression framework by
utilizing the rich feature representation of RKHS functions.
% introducing the reproducing kernel Hilbert spaces (RKHS) model. 
The kernelizing procedure is inspired from kernel instrumental variable (KIV) \citep{singh2019kernel} where the operators are learned for conditional mean embedding in two separate regression stages. The DAG representation is illustrated in \Cref{fig:KAR}\footnote{We note that, as opposed to \Cref{fig:anchor}, there is no edges between $Z$, $X$ and $Y$ as the learning is not based on the original data space, yet using the feature space $\phi(X)\in\H_X$ and $\psi(Z)\in\H_Z$.}.
{In our setting, $Z$ is observable covariates called anchor which may or may not have effects on target $X$ or $Y$. All unobservable latent confounders are denoted by $C$.}

Let $k_\X: \X \times \X \rightarrow \R$, $k_\Z: \Z \times \Z \rightarrow \R$ be measurable positive definite kernels corresponding to 
% scalar-valued 
RKHS $\H_\X$ and $\H_\Z$. Denote the feature maps
$
\psi: \X \rightarrow \H_\X, x \rightarrow k_\X(x, \cdot)$ and 
$\phi: \Z \rightarrow \H_\Z, z \rightarrow k_\Z(z, \cdot).
$
%As shown in Figure~\ref{fig:KAR}, our model is general, as we allow unobserved confounding factor $C$, and anchors $Z$ to influence treatment $X$, outcome $Y$ and confounding factor $C$. 
Let $P_{\z}$ and $Id$ denote the $L_2$-projection on the linear span from the components of $\phi(Z)$ and the identity operator, respectively. Denote $H: \H_\X \to \Y$ as the conditional operator we aim to learn.
Then for $\gamma \geq 0$, define the population-level kernel anchor regression operator $H^{\gamma}$ as
\begin{align}
    {H}^{\gamma}=\argmin_{H}  \Ex[\|(Id-P_{\z})(Y-H\psi(X))\|^2] \nonumber\\
     + \gamma \Ex[\|P_{\z}(Y-H\psi(X))\|^2].
     \label{eq:KAR}
\end{align}
% and its empirical analogue $\widehat{H}^{\gamma}$ 
% \begin{align}
%     \widehat{H}^{\gamma}=\argmin_{H} (  \Ex_{\widehat{p}}[((P_{\z}-Id)(Y-H\psi(X)))^2]\\  + \gamma \Ex_{\widehat{p}}[(P_{\z}(Y-H\psi(X))^2]  ).
% \end{align}
To unravel $P_\z$, both IV and AR estimators applied the two-stage procedure, where the first stage is to estimate the projection operator $P_\z$ and the second stage is to perform the projection-adjusted regression.



\begin{figure}[!tb]
  \centering
\centering
\begin{tikzpicture}
    \node[state] (1) {$Z$};
    \node[state] (2) [right =of 1]{$\mathcal{H}_\Z$};
    \node[state] (3) [below =of 1] {$X$};
    \node[state] (4) [right =of 3] {$\mathcal{H}_\X$};
    \node[state] (5) [right =of 4] {$C$}; %Hidden confounder
    \node[state] (6) [below =of 4] {$Y$};
    
    \path (1) edge node[above] {$\phi$} (2);
    \path (2) edge node[right] {} (4);
    \path (3) edge node[above] {$\psi$} (4);
    \path (2) edge[dashed] node[above] {} (5);
    \path (5) edge node[above, xshift=0.2cm] {} (4);
    %\path (1) edge node[above] {} (3);
    %\path (3) edge node[right, xshift=0.2cm] {$h$} (6);
    \path (4) edge node[right] {} (6);
    \path (5) edge node[above] {} (6);
    \path (2) edge[dashed, bend left=45] node[above] {} (6);
\end{tikzpicture}
\caption{
% \small 
DAG representation for kernel anchor regression. 
% Solid lines represent the dependencies that are also present in the instrumental variable model, while the dashed arrows represent the additional dependencies taken into account in the kernel anchor regression model.
}
\label{fig:KAR}
\vspace{-0.15cm}
\end{figure}



\subsection{Projection Stage}\label{sec:projection}
% \subsubsection{Stage I and Stage II}

The projection stage aims to tackle $P_\z$ by
% In the first two stages, we 
transforming the problem of learning 
$P_\z \psi(X)$ and $P_\z Y$ into two separate
% vector-valued 
kernel ridge regressions.
% , where the hypothesis spaces are the vector-valued RKHS $\H_\Gamma$ and $\H_\Theta$ of operators mapping $\H_\X$ and $\Y$ to $\H_\Z$, respectively. 
% We now state the loss function for optimizing operator $E$. The optimal 
Let operators $E_{X}:\H_\Z \to \H_\X$ and $E_{Y}:\H_\Z \to \Y$ be the projections to learn; 
$\alpha_1, \alpha_2 > 0$ be regularization parameters. {We note that due to the explicit dependency from $Z$ to $Y$, $P_\z Y$ needs to be treated separately from $P_\z \psi(X)$. This is different from the IV setting where $P_\z Y = Y$.
% as the edge from $Z$ to $Y$ in the DAG is absent.
}
The  
% regularized learning 
objectives regularized by Hilbert-Schmidt (HS) norm are
% The projection $P_\z$ solves
\begin{align}\label{eq:proj_x}
    % E_{X,\alpha_1}^{\ast} =  \argmin 
    &\E_{\alpha_1}(E_X) = \Ex
    % _{(Z, X)} 
    \Vert \psi(X) - E_X \phi(Z) \Vert_{\H_\X}^2+ \alpha_1 \Vert E_X \Vert_{HS}^2,\\
    % E_{Y,\alpha_2}^{\ast} = \argmin 
    &\E_{\alpha_2}(E_Y) = \Ex \Vert Y - E_Y \phi(Z) \Vert_{{\Y}}^2+ \alpha_2 \Vert E_Y \Vert_{HS}^2.\label{eq:proj_y}
\end{align}
% The optimal operators are

Denote 
the optimal operators for the population risks as
$E^p_{\alpha_1,X} =  \argmin
_{E_X} 
\E_{\alpha_1}(E_X)$, and
$E^p_{\alpha_2,Y} = \argmin
_{E_Y} 
\E_{\alpha_2}(E_Y)$.
% We then present the estimations via empirical samples of two variants.
We then consider two variants of empirical risks and their corresponding estimations. 

% \paragraph{Empirical estimations}
\subsubsection{Disjoint sample sets projection}\label{sec:disjoint_proj}
Firstly, we treat two ridge regressions in \Cref{eq:proj_x} and  \Cref{eq:proj_y} independently, by using two \textit{disjoint}
sets of samples 
$\S_1 = \{(x_i, z_i)\}_{i \in [n_1]}$ and $\S_2 = \{(y_j, z_j)\}_{j \in [n_2]}$. The empirical forms for \Cref{eq:proj_x} and \Cref{eq:proj_y} are
\begin{align}\label{eq:proj_emp1}
% \tiny
% \E^{n_1}_{\alpha_1}(E_X) = 
\frac{1}{n_1}\sum_{i \in [n_1]}\Vert \psi(x_i) - E_X \phi(z_i) \Vert_{\H_\X}^2 
+ \alpha_1 \Vert E_X \Vert_{HS}^2, \\
\label{eq:proj_emp2}
% \E^{n_2}_{\alpha_2}(E_Y) =   
\frac{1}{n_2}\sum_{j \in [n_2]} \Vert y_j - E_Y \phi(z_j) \Vert_{\Y}^2 
+ \alpha_2 \Vert E_Y \Vert_{HS}^2&.
\end{align}

Denote 
% $\Phi_{1,Z} = (k_\Z(z_1,\cdot), \dots, k_\Z(z_{n_1},\cdot)), z_i \in \S_1$; $\Phi_{2,Z} = (k_\Z(z_1,\cdot), \dots, k_\Z(z_{n_2},\cdot)), z_j \in \S_{2}$; 
$\Phi_{1,Z} = (\phi(z_1), \dots, \phi(z_{n_1}))$, $\{z_i\}_{i\in[n_1]} \subset \S_1$; $\Phi_{2,Z} = (\phi(z_1), \dots, \phi(z_{n_2})), \{z_j\}_{j\in[n_2]} \subset \S_{2}$; 
their 
% corresponding 
gram matrices
$K_{1,ZZ} = \Phi_{1,Z}^\T\Phi_{1,Z} \in \R^{n_1 \times n_1}$ and $K_{2,ZZ} = \Phi_{2,Z}^\T\Phi_{2,Z}\in \R^{n_2 \times n_2}$. % from  
% $\{z_i\}\subset \S_1 $ and $\{z_j\}\subset \S_2$ respectively; 
% $\Psi_{1,X}=(k_\X(x_1,\cdot), \dots, k_\X(x_{n_1},\cdot)), x_i \in \S_1$ and 
Denote $\Psi_{1,X}=(\psi(x_1), \dots, \psi(x_{n_1})), \{x_i\}_{i\in[n_1]} \subset \S_1$ and $Y_2=(y_1,\dots, y_{n_2}), \{y_j\}_{j\in[n_2]}\subset \S_2$. 
By the standard regression formula, the optimal operators to minimize \Cref{eq:proj_emp1} and \Cref{eq:proj_emp2} are
\begin{align}\label{eq:est_proj_x1}
    %\widehat 
    E_{\alpha_1, X}^{n_1} &= \Psi_{1,X}(K_{1,ZZ} + n_1\alpha_1 I)^{-1}\Phi_{1,Z}^\T, \\
    %\widehat 
    E_{\alpha_2, Y}^{n_2} &= Y_2 (K_{2,ZZ} + n_2 \alpha_2 I)^{-1}\Phi_{2,Z}^\T,
    \label{eq:est_proj_y1}
\end{align}

% {\wenqi I delete the hat here but not the corresponding terms below. Probably we should discuss whether we should delete all the hats.}\wk{agreed}
% where $K_{1,ZZ}$, $K_{2,ZZ}$ are the empirical kernel matrix for anchors $Z$ from stage I and stage II observations, respectively, $\Phi_{1,Z}$,  and $\Psi_{1,X}$ are feature matrices of $Z$ and $X$ from stage I observations, $\Phi_{2,Z}$ is the feature matrix of $Z$ from stage II observations, $y_2$ is the vectorization of $\{y_{2,i}\}$.
% Note that due to the disjoint \wk{set of}  i.i.d. samples of $Z$ are used, the gram matrix $K_{1,ZZ}$ and $K_{2,ZZ}$ are independent. 
% We can think of \Cref{eq:est_proj_x1} and \Cref{eq:est_proj_y1} have distinct projections onto feature spaces of $\phi(Z)$.
where the superscripts $n_1, n_2$ explicitly reveal sample sizes.
We note that the  projections $P_{\phi(Z)}$ are estimated differently
% are different 
for $P_{\phi(Z)}\psi(X)$ and $P_{\phi(Z)}Y$, through $(K_{1,ZZ} + n_1\alpha_1 I)^{-1}$ and $(K_{2,ZZ} + n_2 \alpha_2 I)^{-1}$, respectively.
$K_{1,ZZ}$ and $K_{2,ZZ}$ are independent due to the 
% use of 
disjoint i.i.d. sample sets of $Z$.

\subsubsection{Joint sample set projection}\label{sec:joint_proj}
On the other hand, we can also consider the projection analogous to \citep{rothenhausler2018anchor} where we jointly consider the samples used for both projections, i.e. projecting onto the same $\phi(Z)$ subspace. 
Setting $n = n_1 + n_2$ 
% for a fair comparison, 
and $\alpha = \alpha_1 = \alpha_2$, we consider the joint sample set $\S=\{(x_i, y_i, z_i)\}_{i \in [n]}$ and the empirical risks
\begin{eqnarray}\label{eq:proj}
    \frac{1}{n}\sum_{i \in [n]}\Vert \psi(x_i) - E_X \phi(z_i) \Vert_{\H_\X}^2+ \alpha \Vert E_X \Vert_{HS}^2, \\
    \frac{1}{n}\sum_{i \in [n]}\Vert y_i - E_Y \phi(z_i) \Vert_{\Y}^2+ \alpha \Vert E_Y \Vert_{HS}^2.
\end{eqnarray}

Denote $K_{ZZ}  \in \R^{n \times n}$ as the gram matrix from  
$\{z_i\}_{i \in [n]}\subset \S $; 
$\Phi_{Z} = (\phi(z_1), \dots, \phi(z_n)), \{z_i\}_{i\in[n]} \subset \S$; $\Psi_{X}=(\psi(x_1), \dots, \psi(x_{n})), \{x_i\}_{i\in[n]} \subset \S$ and $Y=(y_1,\dots, y_{n}), y_i\in \S$. 
%$\Phi_{Z} = (k_\Z(z_1,\cdot), \dots, k_\Z(z_{n},\cdot)), z_i \in \S$; $\Psi_{X}=(k_\X(x_1,\cdot), \dots, k_\X(x_{n},\cdot)), x_i \in \S$ and $Y=(y_1,\dots, y_{n}), y_i\in \S$. 
Then we have
\begin{align}\label{eq:est_proj_x2}
    % \widehat 
    E_{\alpha, X}^{n} &= \Psi_{X}(K_{ZZ} + n\alpha I)^{-1}\Phi_{Z}^\T, \\
    % \widehat 
    E_{\alpha, Y}^{n} &= Y (K_{ZZ} + n \alpha I)^{-1}\Phi_{Z}^\T.
    \label{eq:est_proj_y2}
\end{align}

By setting the same level of regularization, we can see that the estimates of $P_{\phi(Z)}$ projection, through $(K_{ZZ} + n\alpha I)^{-1}$, are the same for $P_{\phi(Z)}\psi(X)$ and $P_{\phi(Z)}Y$.


\subsection{Regression Stage}\label{sec:regression}
% \subsubsection{Stage III}
With the learned projections $P_{\phi(Z)}\psi(X)$ and $P_{\phi(Z)}Y$, we can now tackle the overall objective in \Cref{eq:KAR}.


Denote $\E(E_X)$ and $\E(E_Y)$ as the unregularized version of \Cref{eq:proj_x} and \Cref{eq:proj_y}; $E^p_X$ and $E^p_Y$ their corresponding optimal operators, respectively. 
For given 
% parameter 
$\gamma$, define the transformed input and output as 
\begin{equation}\label{eq:transform_x}
 \psi_\gamma(X) = \psi(X) - E^p_{X} \phi(Z) + \sqrt{\gamma} E^p_{X} \phi(Z) \in \H_\X,    
 % \vspace{-1em}
\end{equation}
% and
\begin{equation}\label{eq:transform_y}
Y_\gamma = Y - E^p_{Y} \phi(Z) + \sqrt{\gamma} E^p_{Y} \phi(Z) \in \Y.
% \vspace{-2em}
\end{equation} 

\begin{proposition}[Equivalence]
Let $H:\H_\X \to \Y$, and 
%consider the risk on the transformed variable in \Cref{eq:transform_x} and \Cref{eq:transform_y}
consider the regression of transformed output in \Cref{eq:transform_y} on transformed input in \Cref{eq:transform_x}
\begin{equation}\label{eq:transform_obj}
    \E^\gamma(H) = \Ex_{(Z,X,Y)} \Vert Y_\gamma - H \psi_\gamma(X) \Vert_\Y^2.
\end{equation}
The solution to \Cref{eq:transform_obj} is equivalent to the KAR estimator in \Cref{eq:KAR}, i.e.
% \begin{equation}
$H^\gamma = \argmin_{H} \E^\gamma(H).$
% \end{equation}
\end{proposition}
\vspace{-0.5em}
The proof is by expanding the projection 
% operator
$E^p_X$ and $E^p_Y$, which is similar to the linear case in \citet{rothenhausler2018anchor}.


% The problem of learning $H^\gamma$ can be transformed into a scalar-valued kernel ridge regression on transformed data set, where the hypothesis space is the vector-valued RKHS $\H_\Omega$ of operators mapping $\H_\X$ to $\Y$. Given parameter $\gamma$, define transformed input $X_\gamma = \psi(X) - E_{ X}^* \phi(Z) + \sqrt{\gamma} E_{ X}^* \phi(Z) $,
% and transformed outcome $Y_\gamma = Y - E_{ Y}^* \phi(Z) + \sqrt{\gamma} E_{ Y}^* \phi(Z)$. The unconstrained solution is
% \begin{eqnarray*}
%     \E^\gamma(H) &=& \Ex_{(Z,X,Y)} \Vert Y_\gamma - H X_\gamma \Vert_\Y^2,\\
%     H^\gamma &=& \argmin \E^\gamma(H).
% \end{eqnarray*}

With regularization parameter $\xi \geq 0$, \Cref{eq:transform_obj} has the kernel ridge regression form defined as
\begin{equation*}
% \label{eq:transform_reg}
        \E_\xi^\gamma(H) = \Ex_{(Z,X,Y)} \Vert Y_\gamma - H \psi_\gamma(X) \Vert_\Y^2 + \xi \Vert H \Vert_{HS}^2.
\end{equation*}

% \wk{I re-anotate the different notations here. please check the regularizer notation, which one you used in different stage of proof is more convenient.}
% Consider $E_{\alpha_1,X}^{\ast}$, $E_{\alpha_2, Y}^{\ast}$ the solutions for \Cref{eq:proj_x} and \Cref{eq:proj_y} repectively. 
\iffalse
The transformed input and output samples are
$$\psi_{\gamma,l}(x) = \psi(x_{l}) + (\sqrt{\gamma}-1)  E_{X}  \phi(z_{l}) \in \H_\X,$$ 
% and 
$$y_{\gamma,l} = y_{l} + (\sqrt{\gamma}-1) E_{Y}  \phi(z_{l}) \in \Y.$$ 
\fi
The regression stage is 
% formulated 
regardless of how the projections are estimated in \Cref{sec:projection}. For the empirical 
% version, we consider the 
estimation for operators 
% from the projection stage 
$\widehat E_X \in \{ E_{\alpha_1, X}^{n_1},  E_{\alpha, X}^{n}\}$ 
% in \Cref{eq:est_proj_x1} and \Cref{eq:est_proj_x2}; 
and $\widehat E_Y \in \{ E_{\alpha_2, Y}^{n_2}, E_{\alpha, Y}^{n}\}$, 
% in \Cref{eq:est_proj_y1} and \Cref{eq:est_proj_y2}. 
we use sample set $\S^m=\{(x_l, y_l, z_l)\}_{l \in [m]}$, which is disjoint to the set $\S$ used in the previous stages, 
and compute the transformed inputs and outputs as $$\widehat \psi_{\gamma,l}(x) = \psi(x_{l}) + (\sqrt{\gamma}-1) \widehat E_{X}  \phi(z_{l}) \in \H_\X,$$ 
% and 
$$\widehat y_{\gamma,l} = y_{l} + (\sqrt{\gamma}-1) \widehat E_{Y} \phi(z_{l}) \in \Y.$$ 

% respectively; 
The empirical risk has the form
\begin{equation*}
% \label{eq:transform_reg}
        \widehat \E_\xi^{\gamma,m}(H) = \frac{1}{m}\sum_{l \in [m]}\Vert \widehat y_{\gamma,l} - H \widehat \psi_{\gamma,l}(x) \Vert_\Y^2 + \xi \Vert H \Vert_{HS}^2,
\end{equation*}

% \begin{eqnarray*}
%     \E_\xi^\gamma(H) &=& \E^\gamma(H) + \xi \Vert H \Vert_{\H_\Omega}^2,\\
%     \E_\xi^{\gamma,m}(H) &=& \frac{1}{m} \summ  \Vert  y_{\gamma,3,i} - H x_{\gamma,3,i} \Vert_{\Y}^2 + \xi \Vert H \Vert_{\H_\Omega}^2,\\
%     H_\xi &=& \argmin \E^\gamma_\xi(H), \\
%     H_\xi^m &=& \argmin \E_\xi^{\gamma,m}(H).
% \end{eqnarray*}
% However, we do not directly observe the conditional expectation operator $E_{X}$ and $E_{Y}$, so we approximate it using the estimates from first two stages. Let
% \begin{eqnarray*}
%     \hat x_{\gamma,3,i} &=&  \psi(x_{3,i}) + (\sqrt{\gamma}-1) (E_{\alpha_1, X}^{n_1})^*  \phi(z_{3,i}),\\
%     \hat y_{\gamma,3,i} &=& y_{\gamma,3,i} + (\sqrt{\gamma}-1) (E_{\alpha_2, Y}^{n_2})^* \phi(z_{3,i}),\\
%     \hat \E_\xi^{\gamma,m}(H) &=& \frac{1}{m} \summ  \Vert  {\hat y}_{\gamma,3,i} - H \hat x_{\gamma,3,i} \Vert_{\Y}^2 + \xi \Vert H \Vert_{\H_\Omega}^2,\\
%     \hat H_\xi^{\gamma,m} &=& \argmin \hat \E_\xi^{\gamma,m}(H).
% \end{eqnarray*}
$$
\widehat H_\xi^{\gamma,m} = \argmin \widehat \E_\xi^{\gamma,m}(H).
$$

Denote $\widehat \Psi_{\gamma} = (\widehat \psi_{\gamma,1}(x),\dots,\widehat \psi_{\gamma,m}(x))$ and 
% the corresponding
its gram matrix $K_{\widehat \Psi_\gamma  \widehat \Psi_\gamma}=\widehat \Psi_\gamma^{\T}\widehat \Psi_\gamma \in \R^{m\times m}$; $\widehat Y_\gamma=(\widehat y_{\gamma,1},\dots,\widehat y_{\gamma,m})$. Again, by standard regression formula,
\begin{equation}\label{eq:KAR_estimator}
    \widehat H_\xi^{\gamma,m} = \widehat{Y}_\gamma (K_{\widehat \Psi_\gamma  \widehat \Psi_\gamma} + m\xi I)^{-1} \widehat \Psi_\gamma^\T.
    \vspace{-.15em}
\end{equation}
% where $\hat y_\gamma$ is the vectorization of $\{ \hat y_{\gamma,3,i}\}$, the $i$-th row of $\hat x_\gamma$ is $\hat x_{\gamma,3,i}$, and $\hat K_{ X_\gamma  X_\gamma} = \hat x_\gamma^\T \hat x_\gamma$.


\subsection{KAR Estimator}
Given observational data of size $N$, $\{(x_i, y_i, z_i)\}_{i \in [N]}$, the KAR procedure can be performed in two ways based on the two variants in the projection stage.
% \wk{mention randomised index for data splitting below}
\paragraph{Three-stage KAR}
To apply the disjoint sample sets projection in \Cref{sec:disjoint_proj}, we \emph{randomly} split the data set of size $N$ into three disjoint sets of sample size $n_1,n_2,m$ where $N=n_1 + n_2 + m$ 
%and index them accordingly 
and re-index them in $[N]$.
% from $\{1:N\}$. 
The first two sets of data $\{(x_i, z_i)\}_{i \in \{1 : n_1\}}$ and  $\{(y_j, z_j)\}_{j \in \{n_1+1 : n_1+n_2\}}$ are used for learning the projection operators in \Cref{eq:est_proj_x1} and \Cref{eq:est_proj_y1}.{ 
We note that 
%in this stage
samples $\{y_i\}_{i\in\{1:n_1\}}$ and $\{x_j\}_{j\in\{n_1+1:n_1+n_2\}}$ are not used.}
The third set $\{(x_l, y_l, z_l)\}_{l \in \{n_1+n_2+1 : N\}}$ is used in regression stage to learn $ \widehat H_\xi^{\gamma,m}$ in \Cref{eq:KAR_estimator}. 
This procedure, termed \textbf{KAR}, includes solving three different regression problems,
% \footnote{However, our three-stage least square regression procedure is not be confused with the three-stage least square (3SLS) \citep{zellner1962three} under the framework of seemingly unrelated regressions (SUR) and solving simultaneous equation, where their extra stage from 2SLS is concatenating the simultaneous equations for variance reduction. Our third stage here is an additional kernel ridge regression from data splitting. Explicit connections of KAR with 3SLS are provided in the Appendix. }, 
% which is different
and differs from the two-stage settings used in linear anchor regression \citep{rothenhausler2018anchor}.
% Our three-stage least squares algorithm requires sample splitting to alleviate the finite sample bias \citep{angrist1995split}.

\paragraph{Two-stage KAR}
For the joint sample set projection in \Cref{sec:joint_proj}, we only split the data of size $N$ into two disjoint sets \emph{randomly} of size $n$ and $m$ where $N = n + m$ and re-index them such that $\{(x_i, y_i, z_i)\}_{i \in \{1 : n\}}$ and $\{(x_l, y_l, z_l)\}_{l \in \{n+1 : N\}}$.
The first set is then grouped into $\{(x_i, z_i)\}_{i \in \{1 : n\}}$ and  $\{(y_i, z_i)\}_{j \in \{1:n\}}$ to learn the projection operators in \Cref{eq:est_proj_x1} and \Cref{eq:est_proj_y1}.{ 
In this manner, $\{z_i\}_{i\in\{1:n\}}$ are used twice.} 
The second set $\{(x_l, y_l, z_l)\}_{l \in \{n+1 : N\}}$ is used for regression stage to learn $ \widehat H_\xi^{\gamma,m}$ in \Cref{eq:KAR_estimator}, which is the same as the three-stage procedure above.  
This procedure, termed \textbf{KAR.2}, replicates the 2SLS used in KIV \citep{singh2019kernel} and linear anchor regression \citep{rothenhausler2018anchor}. 
Note that the KIV procedure in \citep{singh2019kernel} can be seen as a special case of our \textbf{KAR.2} by choosing $\gamma = \infty$.

% Consider a data set with $n_1 +n_2 +m$ observations of $(Z,X,Y)$, we denote $n_1$ stage I observations by $(z_{1,i}, x_{1,i}, y_{1,i})$, $n_2$ stage II observations by $(z_{2,i}, x_{2,i}, y_{2,i})$, and $m$ stage III observations by $(z_{3,i}, x_{3,i}, y_{3,i})$. Let $n = n_1 +n_2$ denote the number of observations from first two stages.
% The kernel anchor regression estimator can be computed through a three-stage algorithm. This algorithm is an extension to the classic two-stage least squares algorithm (2SLS) that are widely used in economics \citep{bollen1996alternative}.





\section{Analysis of KAR Estimators}\label{sec:analysis}

\subsection{Consistency}
% In this section, w
We first focus on the three-stage KAR procedure with disjoint sample sets for projection in \Cref{sec:disjoint_proj}.
% the consistency of kernel anchor regression estimator.
The closed form solutions and convergence rates of the estimators are extended from the analysis of 2SLS in KIV \citep{singh2019kernel}. 
We follow the integral operator notations in \citep{singh2019kernel}. Define the 
% stage I and stage II 
projection stage operators as
\begin{eqnarray*}
S_1^* &:& \H_\Z \rightarrow L^2(\Z,\rho_\Z), \quad
 l \rightarrow \langle l, \phi(\cdot)\rangle_{\H_\Z},\\
S_1 &:& L^2(\Z,\rho_\Z) \rightarrow \H_\Z, \quad \tilde l \rightarrow \int \phi(z)\tilde l(z) d\rho_\Z(z),
\end{eqnarray*}
where $\rho$ denotes the joint distribution of $(Z, X, Y)$. $L^2(\Z,\rho_\Z)$ denotes the space of square integrable functions from $\Z$ to $\Y$ with respect to measure $\rho_\Z$, where $\rho_\Z$ is the restriction of $\rho$ to $\Z$. $T_1 = T_2 = S_1 \circ S_1^*$ are then uncentered covariance operators.
%We place assumptions on the original spaces $\X$, $\Z$ and $\Y$, the scalar-valued RKHSs $\H_\X$, $\H_\Z$, the adjusted spaces $\X_a$, $\Y_a$ and the probability distribution $\rho (x,z)$.
% 
We define the power of operator $T_1$ with respect to its eigendecomposition. Let $\H_\Gamma = \H_\X\otimes\H_\Z$, $\H_\Theta = \Y \otimes\H_\Z$ and $\H_\Omega = \Y \otimes \H_\X$ be the relevant tensor product spaces for the operators.


\begin{condition}\label{cond::s1}
% Suppose that
% \begin{itemize}
%     \item[\textup{(i)}] $\X$ and $\Z$ are Polish spaces, i.e. separable and completely metrizable topological spaces.
%     \item[\textup{(ii)}] $k_\X$ and $k_\Z$ are continuous and bounded: 
    
%     $\sup_{x \in \X} \Vert \psi(x) \Vert_{\H_\X} \leq Q_1$, $\sup_{z \in \Z} \Vert \phi(z) \Vert_{\H_\Z} \leq \kappa$.
%     \item[\textup{(iii)}] $\psi$ and $\phi$ are measurable.
%      %\citep{sriperumbudur2010relation}
%     \item[\textup{(iv)}] $k_\X$ is characteristic.
%     \item[\textup{(v)}] $E_{X} \in \H_{\Gamma}$, then 
    
%     $\E_{1,X}(E_{X}) = \inf_{E \in \H_{\Gamma}} \E_{1,X}(E)$.
%     \item[\textup{(vi)}] Fix $\zeta_1 < \infty$. For given $c_1 \in (1,2]$, define the prior $\mathcal P (\zeta_1, c_1)$ as the set of probability distributions $\rho$ on $\X \times \Z$ s.t. a range space assumption is satisfied: $\exists G_1 \in \H_\Gamma$ s.t. 
%     $E_{X} = T_1^{\frac{c_1-1}{2}} \circ G_1$ and $\Vert G_1 \Vert_{\H_\Gamma}^2 \leq \zeta_1$.
% \end{itemize}
    
% \item[\textup{(i)}] 
    (i) $\X$ and $\Z$ are Polish, i.e. separable and completely metrizable topological spaces.
% \item[\textup{(ii)}] 
(ii) $k_\X$ and $k_\Z$ are continuous and bounded: 
    $\sup_{x \in \X} \Vert \psi(x) \Vert_{\H_\X} \leq Q_1$, $\sup_{z \in \Z} \Vert \phi(z) \Vert_{\H_\Z} \leq \kappa$.
    % \item[\textup{(iii)}] 
    (iii) $\psi$ and $\phi$ are measurable.
     %\citep{sriperumbudur2010relation}
    % \item[\textup{(iv)}] 
    (iv) $k_\X$ is characteristic.
    % \item[\textup{(v)}] 
    (v) $E_{X}^p \in \H_{\Gamma}$ s.t. 
    % then 
    $\E(E_{X}^p) = \inf_{E_X \in \H_{\Gamma}} \E(E_{X})$.
    % \item[\textup{(vi)}] 
    (vi) Fix $\zeta_1 < \infty$. For 
    % given 
    $c_1 \in (1,2]$, define the prior $\mathcal P (\zeta_1, c_1)$ as the set of probability distributions $\rho$ on $\X \times \Z$ s.t.
    % a range space assumption is satisfied: 
    $\exists G_1 \in \H_\Gamma$ s.t. 
    $E_{X}^p = T_1^{(c_1-1)/2} \circ G_1$ and $\Vert G_1 \Vert_{\H_\Gamma}^2 \leq \zeta_1$.
\end{condition}

Condition~\ref{cond::s1} 
% below 
% has been proposed 
is adapted from \citep{singh2019kernel} to bound the approximation error of the regularized estimator $E_{\alpha_1,X}^{n_1}$. 
Parameter $c_1$ suggests the smoothness of conditional 
% expectation 
operator $E_{\alpha_1,X}^{n_1}$. A larger $c_1$ corresponds to a smoother operator.


\begin{lemma}\label{lem::s1}
$\forall \alpha_1 > 0$, the solution $E_{\alpha_1,X}^{n_1}$ of the regularized empirical objective 
%$\E_{\alpha_1, X}^{n_1}$ 
in \Cref{eq:proj_emp1} exists and is unique. With $
    \mathbf{T}_1 = \frac{1}{n_1} \sumni \phi(z_{i}) \otimes \phi(z_{i})$ and
    $\mathbf{g}_1 = \frac{1}{n_1} \sumni \phi(z_{i}) \otimes \psi(x_{i})$, the estimator in \Cref{eq:est_proj_x1} has the form
    $    E_{\alpha_1,X}^{n_1} = (\mathbf{T}_1 + \alpha_1)^{-1} \circ \mathbf{g}_1.$
% \begin{eqnarray*}
%     E_{\alpha_1,X}^{n_1} &=& (\mathbf{T}_1 + \alpha_1)^{-1} \circ \mathbf{g}_1.
% \end{eqnarray*}
% \wk{This equation about is for introducing alternative formulation of just connecting to $T_1$?}
% 
Under Condition~\ref{cond::s1} and $\alpha_1  = n_1^{-1/(c_1+1)}$, we have:
$  \Vert E_{\alpha_1,X}^{n_1} - E_{X}^p \Vert_{\H_\Gamma} = O_p(n_1^{-\frac{c_1-1}{2(c_1+1)}}).$
% \begin{eqnarray*}
%     \Vert E_{\alpha_1,X}^{n_1} - E_{X}^p \Vert_{\H_\Gamma} = O_p(n_1^{-\frac{c_1-1}{2(c_1+1)}}).
% \end{eqnarray*}
% when $\alpha_1  = n_1^{-1/(c_1+1)}$.
\end{lemma}

% \begin{lemma}\label{lem::s1}
% $\forall \alpha_1 > 0$, the solution $E_{\alpha_1,X}^{n_1}$ of the regularized empirical objective $\E_{\alpha_1, X}^{n_1}$ exists, is unique, and
% \begin{eqnarray*}
%     E_{\alpha_1,X}^{n_1} &=& (\mathbf{T}_1 + \alpha_1)^{-1} \circ \mathbf{g}_1, \\
%     \mathbf{T}_1 &=& \frac{1}{n_1} \sumni \phi(z_{1,i}) \otimes \phi(z_{1,i}), \\
%     \mathbf{g}_1 &=& \frac{1}{n_1} \sumni \phi(z_{1,i}) \otimes \psi(x_{1,i}).
% \end{eqnarray*}
% Under Condition~\ref{cond::s1}, $\forall \delta \in (0,1)$, the following holds w.p. $1 - \delta$:
% \begin{eqnarray*}
%     \Vert E_{\alpha_1,X}^{n_1} - E_{X} \Vert_{\H_\Gamma} \leq
%     r_{E_1}(\delta,n_1,c_1) := \\
%     \frac{ \sqrt{\zeta_1} (c_1 + 1)}{4^{\frac{1}{c_1+1}}} \left( \frac{4\kappa (Q_1 + \kappa \Vert E_{X} \Vert_{\H_{\Gamma}} \ln(2/\delta) }{\sqrt{n_1 \zeta_1}(c_1 -1)} \right)^{\frac{c_1-1}{c_1+1}},\\
%     \alpha_1 = \left( \frac{8\kappa (Q_1 + \kappa \Vert E_{X} \Vert_{\H_\Gamma} \ln(2/\delta) }{\sqrt{n_1 \zeta_1}(c_1 -1)} \right)^{\frac{2}{c_1+1}}.
% \end{eqnarray*}
% \end{lemma}


Lemma~\ref{lem::s1} follows from \citep{singh2019kernel}, and shows that the efficient rate of $\alpha_1$ is $n_1^{-1/(1+c_1)}$. Note that the convergence rate of $E_{\alpha_1,X}^{n_1}$ is calibrated by $c_1$, which measures the smoothness of the conditional expectation operator $E_{X}$.

For the disjoint set projection in \Cref{sec:disjoint_proj},
the closed form solution and convergence rate for learning $P_{\z}Y$ estimator is similar to that of learning $P_\z \psi(X)$ due to the independent estimation procedure and further requires the following conditions.

\begin{condition}\label{cond::s2}
% Suppose that
% \begin{itemize}
    % \item[\textup{(i)}] $\Y$ is Polish space.
    % \item[\textup{(ii)}] $Y$ is bounded: $\sup_{y \in \Y} \Vert y \Vert_{\Y} \leq Q_2$.
    % \item[\textup{(iii)}] $E_{Y} \in \H_{\Theta}$, then 
    
    % $\E_{2,Y}(E_{ Y}) = \inf_{E \in \H_{\Theta}} \E_{2,Y}(E)$. 
    % \item[\textup{(iv)}] Fix $\zeta_2 < \infty$. For given $c_2 \in (1,2]$, define the prior $\mathcal P (\zeta_2, c_2)$ as the set of probability distributions $\rho$ on $\Y \times \Z$ s.t. a range space assumption is satisfied: $\exists G_2 \in \H_\Theta$ s.t. 
    % $E_{Y} = T_2^{\frac{c_2-1}{2}} \circ G_2$ and $\Vert G_2 \Vert_{\H_\Theta}^2 \leq \zeta_2$.
% \end{itemize}
    \textup{(i)} $\Y$ is a Polish space.
    \textup{(ii)} $Y$ is bounded: $\sup_{y \in \Y} \Vert y \Vert_{\Y} \leq Q_2$.
    \textup{(iii)} $E_{Y}^p \in \H_{\Theta}$ s.t. 
    $\E(E_{Y}^p) = \inf_{E_Y \in \H_{\Theta}} \E(E_Y)$.
    \textup{(iv)} Fix $\zeta_2 < \infty$. For 
    % given 
    $c_2 \in (1,2]$, define the prior $\mathcal P (\zeta_2, c_2)$ as the set of probability distributions $\rho$ on $\Y \times \Z$ s.t. 
    % a range space assumption is satisfied: 
    $\exists G_2 \in \H_\Theta$ s.t. 
    $E_{Y}^p = T_2^{(c_2-1)/{2}} \circ G_2$ and $\Vert G_2 \Vert_{\H_\Theta}^2 \leq \zeta_2$.
\end{condition}

\begin{lemma}\label{lem:s2}
$\forall \alpha_2 > 0$, the solution $E_{\alpha_2,Y}^{n_2}$ of the regularized empirical objective 
%$\E_{\alpha_2,Y}^{n_2}$ 
in \Cref{eq:proj_emp2} exists and is unique. With $\mathbf{T}_2 = \frac{1}{n_2} \sumnj \phi(z_{j}) \otimes \phi(z_{j})$ and
    $\mathbf{g}_2 = \frac{1}{n_2} \sumnj \phi(z_{j}) y_{j}$, the estimator in \Cref{eq:est_proj_x1} has the form
    $    E_{\alpha_2,Y}^{n_2} = (\mathbf{T}_2 + \alpha_2)^{-1} \circ \mathbf{g}_2.
    $
% \begin{eqnarray*}
%     E_{\alpha_2,Y}^{n_2} &=& (\mathbf{T}_2 + \alpha_2)^{-1} \circ \mathbf{g}_2.
%     % \mathbf{T}_2 &=& \frac{1}{n_2} \sumni \phi(z_{2,i}) \otimes \phi(z_{2,i}), \\
%     % \mathbf{g}_2 &=& \frac{1}{n_2} \sumni \phi(z_{2,i}) y_{2,i}.
% \end{eqnarray*}
Under Condition~\ref{cond::s1}--~\ref{cond::s2} and $\alpha_2  = n_2^{-1/(c_2+1)}$, we have:
$    \Vert E_{\alpha_2,Y}^{n_2} - E_{Y}^p \Vert_{\H_\Theta} = O_p(n_2^{-\frac{c_2-1}{2(c_2+1)}}).$
% \begin{eqnarray*}
%     \Vert E_{\alpha_2,Y}^{n_2} - E_{Y}^p \Vert_{\H_\Theta} = O_p(n_2^{-\frac{c_2-1}{2(c_2+1)}}).
% \end{eqnarray*}
% when $\alpha_2  = n_2^{-1/(c_2+1)}$.
\end{lemma}


% \begin{lemma}\label{lem:s2}
% $\forall \alpha_2 > 0$, the solution $E_{\alpha_2,Y}^{n_2}$ of the regularized empirical objective $\E_{\alpha_2}^{n_2}$ exists, is unique, and
% \begin{eqnarray*}
%     E_{\alpha_2,Y}^{n_2} &=& (\mathbf{T}_2 + \alpha_2)^{-1} \circ \mathbf{g}_2, \\
%     \mathbf{T}_2 &=& \frac{1}{n_2} \sumni \phi(z_{2,i}) \otimes \phi(z_{2,i}), \\
%     \mathbf{g}_2 &=& \frac{1}{n_2} \sumni \phi(z_{2,i}) y_{2,i}.
% \end{eqnarray*}
% Under Condition~\ref{cond::s1} and Condition~\ref{cond::s2}, $\forall \epsilon \in (0,1)$, the following holds w.p. $1 - \epsilon$:
% \begin{eqnarray*}
%     \Vert E_{\alpha_2,Y}^{n_2} - E_{Y} \Vert_{\H_\Theta} \leq
%     r_{E_2}(\epsilon,n_2,c_2) := \\
%     \frac{ \sqrt{\zeta_2} (c_2 + 1)}{4^{\frac{1}{c_2+1}}} \left( \frac{4\kappa (Q_2 + \kappa \Vert E_{Y} \Vert_{\H_{\Theta}} \ln(2/\epsilon) }{\sqrt{n_2 \zeta_2}(c_2 -1)} \right)^{\frac{c_2-1}{c_2+1}},\\
%     \alpha_2 = \left( \frac{8\kappa (Q_2 + \kappa \Vert E_{Y} \Vert_{\H_\Theta} \ln(2/\epsilon) }{\sqrt{n_2 \zeta_2}(c_2 -1)} \right)^{\frac{2}{c_2+1}}.
% \end{eqnarray*}
% \end{lemma}

Similar to learning projection $P_\z \psi(X)$, the efficient rate of $\alpha_2$ is $n_2^{-1/(1+c_2)}$, where $c_2$ measures the smoothness of the conditional expectation operator $E_{Y}$.


Let $L^2(\H_\X,\rho_{\H_\X})$ denote the space of square integrable functions from $\H_\X$ to $\Y$ with respect to measure $\rho_{\H_\X}$, where $\rho_{\H_\X}$ is the extension of $\rho$ to $\H_\X$. Define the regression stage operator as %\td{$X_{\gamma}$ is changed to $\psi_{\gamma}$ or  $\psi_{\gamma}(X)$; need to unify here and Thm 1. }
%{\wenqi $X_{\gamma}$ and $X_a$ are changed to $\psi_{\gamma}$.}
\begin{eqnarray*}
S^* &:& \H_\Omega \rightarrow L^2(\H_\X,\rho_{\H_\X}), \quad 
H \rightarrow \Omega^*_{(\cdot)} H,\\
S &:& L^2(\H_\X,\rho_{\H_\X}) \rightarrow \H_\Omega, \\ 
&&\tilde H \rightarrow \int \Omega_{\psi_\gamma} \circ \tilde H \psi_\gamma d \rho_{\H_\X}(\psi_\gamma),
\end{eqnarray*}
where $\Omega_{\psi_\gamma}: \Y \rightarrow \H_\Omega$ defined by $y \rightarrow \Omega (\cdot, \psi_\gamma)y$ is the point evaluator of  \citep{micchelli2005learning}. 
% Then d
Define $T_{\psi_\gamma} = \Omega_{\psi_\gamma} \circ \Omega^*_{\psi_\gamma}$ and covariance operator $T=S \circ S^*$. Define the power of operator $T$ with respect to its eigendecomposition. Condition~\ref{cond::s3} below extends hypothesis 7--9 in \citep{singh2019kernel}, and is sufficient to bound the excess error of 
%KAR estimator 
$\widehat H_{\xi}^{\gamma, m}$ with 
% the convergence rate of 
the error propagated from the estimators in the projection stage. 
% I and stage II estimators.

\begin{condition}\label{cond::s3}
% Suppose that
\begin{itemize}
    \item[\textup{(i)}] The $\{ \Omega_{\psi_\gamma}\}$ operator family is uniformly bounded in Hilbert-Schmidt norm: $\exists B$ s.t. $\forall \psi_\gamma$, 
    %$\Vert \Omega_{\psi_\gamma} \Vert_{HS}^2 = Tr( \Omega^*_{\psi_\gamma} \circ \Omega_{\psi_\gamma}) \leq B$.
    $\Vert \Omega_{\psi_\gamma} \Vert_{\L_2(\Y,\H_\Omega)}^2 = Tr( \Omega^*_{\psi_\gamma} \circ \Omega_{\psi_\gamma}) \leq B$. 
    \item[\textup{(ii)}] The $\{ \Omega_{\psi_\gamma}\}$ operator family is is Hölder continuous in operator norm: $\exists L > 0, \iota \in (0,1]$ s.t. $\forall \psi_\gamma, \psi_\gamma^\prime, \Vert \Omega_{\psi_\gamma} - \Omega_{\psi_\gamma^\prime} \Vert_{\L(\Y,\H_\Omega)} \leq L \Vert \psi_\gamma - \psi_\gamma^\prime \Vert_{\H_\X}^\iota$.
    \item[\textup{(iii)}] $H^\gamma \in \H_\Omega$, then $\E^\gamma(H^\gamma) = \inf_{H \in \H_\Omega} \E^\gamma(H)$.
    \item[\textup{(iv)}] $Y_\gamma$ is bounded, i.e. $\exists C < \infty$ s.t. $\Vert Y_\gamma \Vert_\Y \leq C$.
    % almost surely.
    \item[\textup{(v)}] Fix $\zeta < \infty$. For given $b_\gamma \in (1,\infty]$ and $c_\gamma \in (1,2]$, define the prior $\mathcal P (\zeta, b_\gamma, c_\gamma)$ as the set of probability distributions $\rho$ on $\H_\X \times \Y$ s.t. 
    \begin{itemize}
       \item [\textup{(a)}] 
        range space assumption is satisfied: 
        $\exists G \in \H_\Omega$ s.t. $H^\gamma = T^{\frac{(c_\gamma-1)}{2}} \circ G$ and $\Vert G \Vert_{\H_\Omega}^2 \leq \zeta$;
        \item [\textup{(b)}] the eigenvalues from spectral decomposition $T = \sum_{k=1}^\infty \lambda_k e_k\langle\cdot, e_k\rangle_{\H_\Omega} $, where $\{e_k\}_{k=1}^\infty$ is basis of $Ker(T)^{\bot}$
        % \footnote{Ker(T)=\{v|Tv=0\}}
        , 
        % the eigenvalues 
        satisfy $\alpha \leq k^{b_\gamma}\lambda_k \leq \beta$ for 
        % some
        $\alpha, \beta >0$.
        %\wk{what is $\gamma_k$ here? or it's $\lambda_k$. }
    \end{itemize}
\end{itemize}
\end{condition}
We {note that 
%  since $\Omega_{\psi_\gamma}$ and $Y^\gamma$ depend on $\gamma$, 
all parameters mentioned in Condition~\ref{cond::s3} 
depend on $\gamma$, though the function representations are not explicit. We set subscript $\gamma$ especially for parameters $b_\gamma$ and $c_\gamma$ to emphasize their dependency on $\gamma$.} Parameter $b_\gamma$ measures the decay of eigenvalues of the covariance operator $T$, and larger $b_\gamma$ suggests smaller effective input dimension. A larger $c_\gamma$ corresponds to a smoother operator $H^\gamma$.

% The estimator in stage III has a closed form as follows.
\begin{lemma}\label{lem:s3}
$\forall \xi > 0$, 
%the solution $E_{\xi}^{\gamma,m}$ to $\E_\xi^{\gamma,m}$ and 
the solution $\widehat H_\xi^{\gamma,m}$ to $\widehat \E_\xi^{\gamma,m}$ exists and is unique for each $\gamma$. 
%{\wenqi The current definition of $\E_\xi^{\gamma,m}$ is actually $\widehat \E_\xi^{\gamma,m}$; missing definition for $\E_\xi^{\gamma,m}$.} 
%\wk{yes, amended above. so i m thinking to define $\E_\xi^{\gamma,m}$ in the appendix for the proof only. or is there a particular place in main text we need $\E_\xi^{\gamma,m}$?}
%{\wenqi I see. I delete $\E_\xi^{\gamma,m}$ and the corresponding terms in the main text.}
% \begin{eqnarray*}
%     %&\mathbf{T}= \frac{1}{m} \summ T_{\psi_{\gamma,l}}, \quad
%     %\mathbf{g} = \frac{1}{m} \summ \Omega_{\psi_{\gamma,l}} y_{\gamma,l},\\
%     %&H_\xi^{\gamma,m} = (\mathbf{T} + \xi)^{-1} \circ \mathbf{g}, \\
%     &\widehat{\mathbf{T}}= \frac{1}{m} \summ T_{\widehat \psi_{\gamma,l}}, \quad
%     \widehat{\mathbf{g}} = \frac{1}{m} \summ \Omega_{\widehat \psi_{\gamma,l}} \widehat y_{\gamma,l},\\
%     &\widehat H_\xi^{\gamma,m} = ( \hat{\mathbf{T}} + \xi)^{-1} \circ \hat{\mathbf{g}}.
% \end{eqnarray*}
Let $\widehat{\mathbf{T}}= \frac{1}{m} \summ T_{\widehat \psi_{\gamma,l}}$, $\widehat{\mathbf{g}} = \frac{1}{m} \summ \Omega_{\widehat \psi_{\gamma,l}} \widehat y_{\gamma,l}$. \Cref{eq:KAR_estimator} has the form
$$\widehat H_\xi^{\gamma,m} = ( \widehat{\mathbf{T}} + \xi)^{-1} \circ \widehat{\mathbf{g}}.$$
\end{lemma}


\begin{condition}
\label{cond::s1&2}
For $c_1, c_2$ 
% {\wenqi(I delete $\iota$ because it cannot be set)}
% and $\iota$ 
set in Conditions \ref{cond::s1}--~\ref{cond::s2} and $\iota$ satisfying Condition~\ref{cond::s3}, assume
$n_2 \geq n_1^{\frac{\iota(c_1-1)(c_2+1)}{(c_1+1)(c_2-1)}}$.
\end{condition}

\begin{remark}
Condition~\ref{cond::s1&2} is sufficient but not necessary to ensure that the error propagates to regression stage from estimating $E_Y^p$ is smaller than that from estimating $E_X^p$ in disjoint sample sets projection.
% the error estimating $E_Y$ is smaller than estimating $E_X$ and propagate to regression stage.
% {\wenqi I change the remark, but it's still a little bit weird.}
\end{remark}

The main challenge of extending the convergence rate of 
% kernel instrumental variable 
KIV estimator \citep{singh2019kernel} to 
% kernel anchor regression 
KAR estimator is that in our case, the excess error depends not only on the accuracy of $E_X^p$ estimator but also on the accuracy of $E_Y^p$ estimator.
% , while the latter source of error does not exist in kernel instrumental variable case. 
However, by proposing Condition~\ref{cond::s1&2}, we ensure the error 
% caused by
from estimating $E_Y^p$ is dominated by that of $E_X^p$, and manage to illustrate the optimal convergence rate for KAR
% kernel anchor regression 
as shown in Thereom~\ref{thm::s3c}.
In this way, the three-stage procedure can guarantee the same convergence rate as the two-stage procedure in KIV.

%We require that the error propagated from estimating $E_Y$ to be dominated by that from estimating $E_X$ in the projection stage, because the estimation error of $E_X$ influences the final excess error in a more complex way. Intuitively, $\Vert \Omega_{\widehat \psi_\gamma} - \Omega_{\psi_\gamma} \Vert$ contributes to both $\Vert \hat{\mathbf{g}} -  \mathbf{g} \Vert$ and $\Vert \hat{\mathbf{T}} -  \mathbf{T} \Vert$, while $\Vert \widehat y_{\gamma} - y_{\gamma} \Vert$ only contributes to $\Vert \hat{\mathbf{g}} -  \mathbf{g} \Vert$. The exact components of excess error are given in supplementary material.

%The main challenge of extending the convergence rate of kernel instrumental variable estimator to kernel anchor regression estimator is that the error $\Vert \hat{\mathbf{g}} -  \mathbf{g} \Vert$ is now a compound of $\Vert \Omega_{\widehat \psi_\gamma} - \Omega_{\psi_\gamma} \Vert$ caused by estimating $E_X$ and $\Vert \widehat y_{\gamma} - y_{\gamma} \Vert$ caused by estimating $E_Y$, while the excess error of kernel instrumental variable estimator in \citep{singh2019kernel} only contains the first term. However, we find that by assuming Condition~\ref{cond::s1&2}, the error caused by estimating $E_X$ dominates the term $\Vert \hat{\mathbf{g}} -  \mathbf{g} \Vert$. As Thereom~\ref{thm::s3c} below shows, the convergence rate of $\hat H_\xi^{\gamma,m}$ is then identical to the convergence rate of Kernel IV estimator from \citep{singh2019kernel}.

\begin{theorem}\label{thm::s3c}
Under Condition~\ref{cond::s1}--~\ref{cond::s1&2}, let $d_1, d_2 > 0$
and 
choose
$\alpha_1 = n_1^{-\frac{1}{c_1+1}}$, $\alpha_2 = n_2^{-\frac{1}{c_2+1}}$,
    $n_1 = m^{\frac{d_1(c_1+1)}{\iota(c_1-1)}}$, 
    $n_2 = m^{\frac{d_2(c_2+1)}{\iota(c_2-1)}}$,
% \begin{eqnarray*}
%     \alpha_1 = n_1^{-\frac{1}{c_1+1}}, \quad
%     \alpha_2 = n_2^{-\frac{1}{c_2+1}},\\
%     n_1 = m^{\frac{d_1(c_1+1)}{\iota(c_1-1)}}, \quad
%     n_2 = m^{\frac{d_2(c_2+1)}{\iota(c_2-1)}},
% \end{eqnarray*}
% $\alpha_1 = n_1^{-\frac{1}{c_1+1}}$, $\alpha_2 = n_2^{-\frac{1}{c_2+1}}$, $n_1 = m^{\frac{d_1(c_1+1)}{\iota(c_1-1)}}$ and $n_2 = m^{\frac{d_2(c_2+1)}{\iota(c_2-1)}}$, 
% where $d_1, d_2 > 0$. 
we have:
\begin{itemize}
    \item [\textup(i)] If $d_1 \leq \frac{b_\gamma(c_\gamma+1)}{b_\gamma c_\gamma+1}$, then $\E^\gamma(\widehat H_\xi^{\gamma,m}) - \E^\gamma( H^\gamma) = \op(m^{-\frac{d_1 c_\gamma}{c_\gamma+1}})$ with $\xi = m^{-\frac{d_1}{c_\gamma+1}}$.
    \item [\textup(ii)] If $d_1 > \frac{b_\gamma(c_\gamma+1)}{b_\gamma c_\gamma+1}$, then $\E^\gamma(\widehat H_\xi^{\gamma,m}) - \E^\gamma( H^\gamma) = \op(m^{-\frac{b_\gamma c_\gamma}{b_\gamma c_\gamma+1}})$ with $\xi = m^{-\frac{b_\gamma}{b_\gamma c_\gamma+1}}$.
\end{itemize}
\end{theorem}
At $d_1 = b_\gamma(c_\gamma+1)/(b_\gamma c_\gamma+1) < 2$, the convergence rate of KAR estimator $m^{-b_\gamma c_\gamma/(b_\gamma c_\gamma+1)}$ is optimal. 
%while requiring the fewest observations. 
This statistically efficient rate is calibrated by $b_\gamma$, the effective input dimension, 
% as well as
together with $c_\gamma$, the smoothness of 
% structural 
the operator $H^\gamma$. The condition $d_1 = b_\gamma(c_\gamma+1)/(b_\gamma c_\gamma+1) < 2$ also suggests that $n_1 > m$.
{We provide additional discussion on the two-stage KAR estimator, and show that only a lower convergence rate can be guaranteed (see Section A.3 in supplementary material).}

\iffalse
\begin{corollary}[two-stage]
Let ... then $\|H_{\xi}^{\gamma, m} - H \|_{\H_{\Omega}}\to 0$ as $n,m\to \infty$.
\end{corollary}
\td{Then comment on 2-stage 3-stage difference}
\fi
% Additional results and discussions including the two-stage
% approach are included in the Appendix.


% Despite the three-stage algorithm we propose, it's also possible to compute kernel anchor regression estimator through a two-stage algorithm by combining our stage I and stage II and learning $P_\z \psi(X)$ and $P_\z Y$ together by one vector-valued kernel ridge regression. To be specific, with given regularization parameter $\alpha$, let 
% \begin{eqnarray*}
%     (E_{\alpha, X}^{n})^* &=& \Psi_{X}(K_{ZZ} + n \alpha I)^{-1}\Phi_{Z}^\T, \\
%     (E_{\alpha, Y}^{n})^* &=& y^\T (K_{ZZ} + n \alpha I)^{-1}\Phi_{Z}^\T,
% \end{eqnarray*}
% denote the estimates of $E_{\rho, X}$ and $E_{\rho, Y}$, respectively, where $K_{ZZ}$, $\Phi_{Z}$, $y$ are defined similarly to $K_{1,ZZ}$, $\Phi_{1,Z}$, $y_2$, by replacing the stage I observations or stage II observations with a combination of stage I and stage II observations together ($n$ units in total). A new estimator $\tilde H_\xi^m$ can then be defined similarly to $\hat H_\xi^m$ by replacing $E_{\alpha_1, X}^{n_1}$ with $E_{\alpha, X}^{n}$, and $E_{\alpha_2, Y}^{n_2}$ with $E_{\alpha, Y}^{n}$. 

% ?Estimator $\tilde H_\xi^m$ can be computed from a two-stage algorithm, which seems to be more concise than the three-stage algorithm required for $\hat H_\xi^m$. However, the excess error bound for $\tilde H_\xi^m$ is no smaller than $\hat H_\xi^m$. Therefore, we recommend the three-stage algorithm and $\hat H_\xi^m$ in terms of efficiency.

%{\wenqi I'm not sure about this. Generally speaking, if $c_1 \neq c_2$, then for $\forall \alpha \geq 0$, the upper bounds that can be derived for $\Vert E_{\alpha, X}^{n} - E_{\rho, X} \Vert$ and $\Vert E_{\alpha, Y}^{n} - E_{\rho, Y} \Vert$ are larger than the upper bounds for $\Vert E_{\alpha_1, X}^{n_1} - E_{\rho, X} \Vert$ and $\Vert E_{\alpha_2, Y}^{n_2} - E_{\rho, Y} \Vert$, respectively. These upper bounds will be propagated to the upper bounds for the final excess error. However, I can only compare the upper bounds, and this doesn't seem to be enough for comparison between convergence rate. For example, if $a_1 = O(f_1)$, $a_2 = O(f_2)$, then $f_1 = O(f_2)$ cannot ensure $a_1 = O(a_2)$.} 

\subsection{Causal effect and target KAR estimate}
In this section, we discuss the scenarios assuming that the data are generated from a 
% kernelized 
structural causal model with nonlinear features 
as shown below,
\begin{align}\label{eq:nonlinear_sem}
    % \left(
    \begin{pmatrix}
        C  \\
        \psi(X) \\
        Y \end{pmatrix}
        % \right)
        = B
%   \left(
   \begin{pmatrix}
        \phi(Z) \\
        C  \\
        \psi(X) \\
        Y \end{pmatrix}
        % \right)
        + 
        % \left(
        \begin{pmatrix}
        \epsilon_C  \\
        \epsilon_X \\
        \epsilon_Y\end{pmatrix},
        % \right), 
\end{align}
where we write operator $B$ in the following matrix form
% $\phi(Z) = \epsilon_z$, and
% $$
% B = \left( \begin{array}{cccc}
%         B_{ZC} & 0 & 0 & 0 \\
%         B_{ZX} & B_{CX} & 0 & 0 \\
%         B_{ZY} & B_{CY} & B_{XY} & 0
%   \end{array} \right)
% $$
$$
B = \begin{pmatrix}
        B_{CZ} & 0 & 0 & 0 \\
        B_{XZ} & B_{XC} & 0 & 0 \\
        B_{YZ} & B_{YC} & B_{YX} & 0
   \end{pmatrix}.
$$
We note that each operator $B_{\triangle \square}$ represents an operator that takes an element from $\square$-related space to $\triangle$-related space, e.g. $B_{XZ}:\H_\Z \to \H_\X$ and $B_{YZ}:\H_\Z \to \Y$.
% 
% is an unknown constant matrix, the anchors $Z$ and the hidden variables $C$ are random vectors, 
The 
% (vector-valued)
noise variables $\epsilon_Z$, $\epsilon_C$, $\epsilon_X$ and $\epsilon_Y$ 
% random vectors that 
are independent of each other. Let $\Sigma_Z$, $\Sigma_C$, $\Sigma_X$ and $\Sigma_Y$ denote the covariance of $\epsilon_Z$, $\epsilon_C$, $\epsilon_X$ and $\epsilon_Y$, respectively. Here each 
% sub-matrix 
operator in $B$ represents a line in the model shown in Figure~\ref{fig:KAR}. For instance, $B_{CZ}$ stands for the line from $\H_\Z$ to $C$; $B_{YX}$ corresponds to the line from $\H_\X$ to $Y$. $B_{YX}$ in \Cref{eq:nonlinear_sem} reflects the causal effect we are interested in.
% \wk{The $\phi(Z)$ and $\psi(Y)$ here are set to be finite dimensional vector, so that $B_{ZC}$ and $B_{XY}$  are finite dim matrices instead of operators. The proof in draft are all done in linear $Z$ form. This is not a problem, but I wounder setting $B_{ZC}:\H_\Z \to \mathcal{C}$ as and operator, where $\mathcal C = \R$. Then, we need to define zero operator, e.g. $B_{ZC}\phi(z)=0,\forall z$ to amend the proof.}
% \wk{Another thing I wanna suggest is use $B_{CZ}$ instead of $B_{ZC}$, which naturally implying mapping from $Z$-related space to $C$ related space. It provides better intuition for derivations in the proof you will see.}
We study the identifiability scenarios where operator $B_{YX}$ can be learned via 
KAR
estimator $H^{\gamma}$.
%{\wenqi I'm confused whether we can let $B$ denote operator.}
\begin{theorem}
\label{thm::causal} 
An operator $B_{XZ}$ is a zero operator written by $B_{XZ}=0$, if $\langle \psi(x), B_{XZ}\phi(z)\rangle_{\H_\X}=0$, $\forall \psi(X) \in \H_\X, \phi(z) \in \H_\Z$.
Operator $B_{CZ} = 0$ if $c^\T B_{CZ}\phi(z)=0$, $\forall c \in \mathcal C, \phi(z) \in \H_\Z$. A matrix-valued operator, e.g. $B_{YC}=0$ if all entries are $0$.
% 
For data generation process following \Cref{eq:nonlinear_sem}, we have $H^\gamma = B_{YX}$ in following cases.
\begin{itemize}
\vspace{-0.35cm}
    \item [\textup{(i)}] $B_{YC}=0$ and $\gamma = 0$, 
    i.e.
    % which suggests 
    % the case with 
    no 
    % unobserved 
    latent confounder.
    \item [\textup{(ii)}] $B_{YZ} + B_{YC}B_{CZ} = 0$ and $\gamma = \infty$, where kernel IV  is a special case, i.e. both $B_{YZ} = 0$ and  $B_{CZ} = 0$.
    \item [\textup{(iii)}] $B_{YC}=0$, $B_{YZ} + B_{YC}B_{CZ} = 0$, and $\gamma \geq 0$.
%The bias $\Vert H^\gamma - B_{XY} \Vert_2^2$ reaches its minimum in following cases:
    %\item [\textup{(iv)}] $\Sigma_{XY}^\para = a \Sigma_{XY}^\res$ for some $a > 0$, and $\gamma = \infty$.
    \item [\textup{(iv)}] $\Sigma_{YX}^\para = - a \Sigma_{YX}^\res$ for some $a > 0$, and $\gamma = 1/a$.
    
    $\Sigma_{YX}^\para = (B_{YZ} + B_{YC}B_{CZ})\Sigma_Z (B_{ZX} + B_{ZC}B_{CX}) $ denotes the covariance between $\psi(X)$ and $Y$ projected on the linear span from the components of $\phi(Z)$; and $\Sigma_{YX}^\res = B_{YC}\Sigma_{C}B_{CX}$ denotes the covariance between the residuals of $\psi(X)$ and $Y$.
\vspace{-0.cm}
\end{itemize}
\end{theorem}
Thereom~\ref{thm::causal} (i) 
% above 
suggests that KPA (kernel partialling out regression)
% estimator 
is optimal 
% if assume 
when there is no unobserved confounder; (ii) suggests that KIV identifies the causal effect under 
a generalized condition including
% kernel 
KIV assumption, i.e. $B_{YZ}= 0$ and $B_{YC} = 0$;
% By Thereom~\ref{thm::causal}
(iii) shows the KAR estimator identifies the causal relation from $X$ to $Y$ regardless of $\gamma$ with generalized KIV condition in (ii) and no latent confounder in (i). The condition $\gamma \geq 0$  in (iii) actually implies that $H^\gamma$ is constant over all $\gamma \geq 0$, which coincides with the definition of anchor stability in \citep{rothenhausler2018anchor}.
% when both assumptions on no unobserved confounding and valid instrumental variable is true, 
% By Thereom~\ref{thm::causal}(iv), if
Thereom~\ref{thm::causal}
(iv) shows the KAR identifiability condition with appropriate choice of $\gamma$ when $\Sigma_{XY}^\para$ and $\Sigma_{XY}^\res$ 
%are in the same direction, kernel IV estimator is optimal in terms of bias; if $\Sigma_{XY}^\para$ and $\Sigma_{XY}^\res$ 
are in the 
% opposite 
flipped direction. 
To further illustrate the identifiability condition, consider the linear case and assume that $X$, $Z$ and $C$ have only one dimension. In this case, it's not trivial for $\Sigma_{YX}^{\para} = -a \Sigma_{YX}^{\res}$ holds for some $a > 0$, which ensures KAR to identify the causal effect. We stress that the identifiability condition in Theorem 2 (iv) does not include no hidden confounding ($B_{YC} = 0$) nor valid instrument variable ($B_{YZ} = 0$ and $B_{YC} = 0$). 

% Set $ B_{YC} = B_{CZ} = B_{ZX} = B_{ZC} = B_{CX} = \Sigma_Z = \Sigma_C = 1$ and $B_{YZ} = - 3$. We then have $\Sigma_{YX}^{\para} = -4 \Sigma_{YX}^{\res}$. Therefore, $H^\gamma$ with $\gamma = 0.25$ can identify the target causal effect $B_{YX}$.
% Note that by Theorem 2 (iv), we show that anchor regression can identify causal effects without assuming no hidden confounding or valid instrumental variable. For linear anchor regression, think about the case when $X$ has only one dimension, then it's not trivial for $\Sigma_{YX}^{\para} = -a \Sigma_{YX}^{\res}$ holds for some $a > 0$, which is the condition required for identification of causal effect. Again we stress that in Theorem 2 (iv), we do not assume no hidden confounding ($B_{YC} = 0$) or valid instrument variable ($B_{YZ} = B_{YC}B_{CZ} = 0$). 

% The condition requires positive constant $a$ that is reciprocal of $\gamma$.
% KAR estimator with appropriate choice of $\gamma$ performs better than both kernel IV or kernel PA estimator. 
% \wk{what did you mean by perform better?}
In the next section, we show the empirical results for KAR estimators compared with relevant baseline methods.


\begin{figure}[t!]
    \centering
    % \includegraphics[width=0.45\textwidth,height=0.4\textwidth]{fig/Sim_fit_Combined.pdf}
    \includegraphics[width=0.23\textwidth]{fig/Sim_fit1.pdf}\includegraphics[width=0.23\textwidth]{fig/Sim_fit2.pdf}
    
    % \subfigure[MSE]
    {\includegraphics[width=0.42\textwidth
    ,height=0.18\textwidth
    ]{fig/MSE1.pdf}\label{fig:mse_synthetic}} 
        % \vspace{-0.3cm}
    \caption{
    % \small 
    % Fitted models:
    Synthetic example:
    fitted
    (top left) nonlinear models; (top right) linear models;
    % KAR, KAR.2, KIV, KPA, AR, IV and PA estimators.
    (bottom): log MSE.}
    \vspace{-0.15cm}
    \label{fig:kar_fit}
\end{figure}

\section{Empirical Results}\label{sec:simulation}

\subsection{Synthetic experiments}\label{sec:synthetic}
We consider the data generating process
of the following nonlinear structural equation,
% We conduct simulation to evaluate kernel anchor regression. Set $n_1 = 250$, $n_2 = 250$, $m = 200$, $n=n_1+n_2 = 500$, $\alpha_1 = 1.5n_1^{-0.5}$, $\alpha_2 = 1.5n_2^{-0.5}$, $\alpha = 1.5n^{-0.5}$, and $\xi = 1.5m^{-0.5}$. The structural model is set as follows,
\begin{equation*}
    Y = 0.75C - 0.25Z + \ln(|16X - 8| + 1) sgn (X - 0.5),
\end{equation*}
where $sgn(x)\in \{-1, 0, +1\}$ denotes the sign of $x$.
The explanatory variables $X, Z, C$ are generated from 
% the a joint normal distribution followed by the transformations as follows.
\begin{eqnarray*}
% \left( 
\begin{pmatrix}
        C  \\
        V \\
        W \end{pmatrix} 
        % \right) 
        \sim
        N \left( 
        % \left( 
        \begin{pmatrix}
        0  \\
        0 \\
        0 \end{pmatrix}, 
        % \right), 
        % \left( 
        \begin{pmatrix}
        1, 0.3, 0.2  \\
        0.3, 1, 0 \\
        0.2, 0, 1 \end{pmatrix} 
        % \right)
        \right),\\
    X = F \left( \frac{W+V}{\sqrt{2}} \right),\quad
    Z = F(W) - 0.5,
\end{eqnarray*}
where $F$ denote the c.d.f of standard normal distribution.
{For our learning procedure, $Z$, $X$ and $Y$ are available, and $C$ is unobservable.}
%\wx{Comment on which is $Z$ $C$ $X,Y$ are observed or not.}

We generate $\{(x_i,y_i,z_i)\}_{i\in[N]}$ with $N=700$. To perform the data-splitting procedures described in \Cref{sec:kar}, we set $n_1 = n_2 = 250 $ and $n=500$ for a fair comparison in the projection stage 
%(\Cref{sec:projection})
; and $m=200$ in the regression stage 
%(\Cref{sec:regression})
. We set regularizers as
$\alpha_1 = 1.5n_1^{-0.5}$, $\alpha_2 = 1.5n_2^{-0.5}$, $\alpha = 1.5n^{-0.5}$ and $\xi = 1.5m^{-0.5}$. 



% We conduct simulation to evaluate kernel anchor regression. 
\paragraph{Fitting methods}
We consider estimations via the three-stage kernel anchor regression with disjoint data set projection (\textbf{KAR}) and two-stage kernel anchor regression with joint data set projection
% algorithm 
(\textbf{KAR.2}). The baseline approaches include the kernel-based nonlinear methods: kernel instrument variable regression (\textbf{KIV}), kernel partialling out regression (\textbf{KPA}), kernel ridge regression (\textbf{KReg}); and the linear models: linear anchor regression (\textbf{AR}), linear instrument variable regression (\textbf{IV}), linear partialling out regression (\textbf{PA}) and ordinary least square (\textbf{OLS}). 
% 
We use Gaussian kernel for all kernel-based methods, where the median heuristic is used for choosing the bandwidth \citep{gretton2012kernel}. 
%\wx
{We show that the median heuristic is a good choice 
%in this setting 
by achieving close-to-optimal cross-validation error (see Section B.3 in supplementary material).}
% \begin{wrapfigure}{r}{20mm}
%     \includegraphics[width=0.134\textwidth]{fig/rebuttal/CV.pdf}
%     \caption{Cross-validation}
%     \label{fig:mnist_real}
% \end{wrapfigure}
% lengthscales are set according to median interpoint distance. 
For the synthetic example, we set $\gamma = 2$ for all anchor regressions (\textbf{KAR}, \textbf{KAR.2} and \textbf{AR}). 
% 


% The fitted model is shown in \Cref{fig:kar_fit}. From the result, we can see that the \textbf{KAR} produce the closest estimation  to the true model among all methods, and outperforms \textbf{KAR.2} in this scenario. The comparison with linear models are also shown. In this case, IV model fitted better than other linear models. 
% estimation result is shown in Figure~\ref{fig::1}.
For each algorithm, we 
% then
implement 50 trials 
% simulations 
and calculate the mean squared error (MSE) with respect to the true causal model $\Ex( Y | do(x))$\footnote{Setting a particular value $X=x$ while ignoring other variables that may potentially changing the distribution of $y$, $p(y|X=x)$ is noted as $p(y|do(x)$ \citep{pearl2009causality, peters2016causal}. $\Ex[Y|do(x)]$ is set us the mean over $p(y|do(x))$ averaging out  different $Z$ values in this case.}, which can be computed from the structural model. 
%{\wenqi Can we put $p(y | do(x))$ here?} \wk{To clarify: let $f(x)$ be the fitted model; so the MSE is by comparing $\int_X(f(x) - y)^2p(y|x)$? If I understand this correctly, I agree with the $do(x)$ notation. } {\wenqi  MSE: $\int_X(f(x) - \Ex(Y|do(x)))^2dx$. }
% (the red line in \Cref{fig:kar_fit}).
A trial is shown in \Cref{fig:kar_fit} as a visual example. We can see that the \textbf{KAR} produces a closest estimation to the true model among all methods and outperforms \textbf{KAR.2}. The comparison with linear models are also shown. IV model fits better than other linear models.
We report $\log_{10}(\text{MSE})$ in the bottom of \Cref{fig:kar_fit}, which shows
% As shown in the left side of Figure~\ref{fig::2}, 
that both KAR methods 
% perform better 
have smaller errors than others. \textbf{KAR} performs slightly better than \textbf{KAR.2} in this case. 
To check
the robustness of KAR estimators, we study a less smooth variant of the data generating process and show the results in Section B.2 in supplementary material.
% \Cref{app:variant}.




\begin{figure}[t!]
    \centering
    % \vspace{-0.2cm}
\includegraphics[width=0.5\textwidth]{fig/IVcase.pdf}
    % \vspace{-0.3cm}
    \caption{
    % \small 
    Effects of different $\gamma$ choices on MSE.}
    \label{fig:MSE_IV}
    % \vspace{-0.15cm}
\end{figure}
\paragraph{The effect of $\gamma$ choices}



%{\wenqi Well, this figure is for KIV setting. I can include different $\gamma$s for the original case if necessary.}
%\wk{Indeed. that's me making up the story to merge the KIV setting and say KAR can be better. we don't need different $\gamma$ for original case in the main text. prob no space. if result interesting, we can do it in appendix.}
To investigate how the change of $\gamma$ affects the estimator, 
we consider KIV as our baseline as the IV setting corresponds to $\gamma \to \infty$. We consider the data generating process used in the KIV paper \citet{singh2019kernel}.
% whose details are included in \Cref{app:kiv}. 
% to show that KAR with appropriate choices of $\gamma$ performs better.
The $\log_{10}$(MSE) results of \textbf{KAR} and \textbf{KAR.2}, in comparison with \textbf{KIV}, are shown in \Cref{fig:MSE_IV}. For the simulation, we set $N=1000$, $n_1 = n_2 = 200$,  $n=n_1+n_2 =400$ and $m = 600$. 
From the result, we can see that both \textbf{KAR} and \textbf{KAR.2} achieves smaller error when choosing $\gamma=2$.
Data generation and model implementation details are included in Section B.1 supplementary material.
% \Cref{app:}.



\begin{figure}[t!]
    \centering
        % \vspace{-0.3cm}
    {\includegraphics[width=0.5\textwidth]{fig/Sim_PE_A.pdf}}
    % \vspace{-0.35cm}
    \caption{
    % \small 
    Prediction error with distributional intervention.}\label{fig:PE_synthetic}
    \vspace{-0.5cm}
\end{figure}

\paragraph{Intervention and Generalization}
To evaluate the robustness and generalization performance of both KAR estimators under distribution shift, as discussed in \citep{rothenhausler2018anchor}, we intervene the anchor variable $Z$. We train the model on a subpopulation of samples with $Z<0$ and test on the samples with $Z \geq 0$. 
The performance is measured by prediction error (PE) of fitted model with respect to $\Ex(Y|X=x, Z \geq 0)$, 
%\wk{again to clarify, $f(x)$ is fitted model, PE is $\int_X\int_{z=0}^{0.5} (y - f(x))^2 p(y|x, z)dzdx$, correct?} {\wenqi PE: $\int_X (f(x) - \Ex(Y|X=x, Z \geq 0))^2 p(x|z \geq 0)dx$}, 
where the true conditional model is not known in closed form but estimated from samples. 

We also exchange the training set and the testing set.
%We also do it in the opposite way.
%\td{check is this correct}.
As shown in \Cref{fig:PE_synthetic}, our \textbf{KAR} estimator has the lowest PE among others, 
% as well as very similar PE in the two cases,
% under such distribution perturbation, 
showing better out-of-distribution generalization performance. More importantly, by checking the 
% PE performance on 
two 
% different 
(flipped) scenarios, i.e. train on $Z<0$ v.s. train on $Z\geq 0$, we also see that \textbf{KAR} is the most invariant in terms of PE. On the contrary, linear version of \textbf{AR} and \textbf{IV} achieves very different PE in both cases. 
Variances of PE for \textbf{KPA} are also very different in the two cases. Despite \textbf{KReg} achieving a relatively low PE in both cases, the distributions of PE can be found very different. 
{In supplementary material (see Section B.1), we also illustrate that KAR can outperform other non kernel-based approaches, e.g. DeepIV or SmoothIV.}
%\wx{Comment on better than Deep IV in this case.}



\subsection{Real-world application}
% For the real data application, 
We consider the smoking dataset extracted from 
% the 1987
National Medical Expenditure Survey (NMES) \citep{johnson2003disease} to study the effect of smoking amount on medical expenditure \citep{imai2004causal}\footnote{The dataset is accessible through using the \texttt{R} package for ``estimating causal dose response function'' \texttt{causaldrf} \citep{galagate2016causal} \url{https://cran.r-project.org/web/packages/causaldrf/index.html}.}.

The treatment variable $X$ is the $\log$ of smoking amount, the outcome $Y$ is the $\log$ of medical expenditure, and the anchor $Z$ is set to be the last age for smoking. 
% We conduct all the estimators applied in synthetic experiments. 
We use $1000$ samples, randomly selected from $9708$ available samples, to fit the model. We set 
% sample size
$n_1=n_2=300$, 
% for \textbf{KAR},
$n=600$ 
% for \textbf{KAR.2} 
and $m=400$. 
We also set 
$\gamma = 2.9$ and apply Gaussian kernel with median heuristic bandwidth \citep{gretton2012kernel} for all kernel methods. 
As shown in the upper part of Figure~\ref{fig::sm}, KAR estimators show that the effect of $X$ on $Y$ is more significant when $X\in [-2,1]$ compared to $X\in [1,4]$.
Our method can also be used in complement with the approaches finding causal directions, e.g.  \citep{peters2016causal}\footnote{Implementation with \texttt{R} package \texttt{CAM} can be found at \url{https://rdrr.io/cran/CAM/man/CAM.html}}. We run the CAM to ensure that there is a causal effect in the direction from $X$ to $Y$ and KAR procedure further learns the specific function representing such effect.
%from the smoking amount ($X$) to the medical expenditure ($Y$). 
% We compare our results with existing studies and find 
% The KAR procedure can 
However, existing work such as propensity score approaches \citep{imai2004causal} did not manage to extract such causal relationship between smoking 
% amount 
and medical expenditure.
% and did not find a causal relationship between these two variables.
% {\wenqi I'm not sure whether it's safe to use "improve" here.}
% the Causal Additive Model (CAM)  

To strengthen our finding, we quantify the performance of the estimators. Since we do not know the real generating process
% causal model 
% behind 
of the data, we cannot 
% compute and 
compare the MSE as \Cref{fig:kar_fit} and \ref{fig:MSE_IV}.
% results, but
% However, 
Instead, it's feasible to evaluate the estimators' performance under distribution perturbation via PE, similar to \Cref{fig:PE_synthetic}. We train models on male subjects and compute the prediction accuracy of fitted model on female subjects. 
%The results are 
As shown in the bottom of Figure~\ref{fig::sm},
%From the result, 
we see that both KAR approaches outperform other kernel-based approaches as well as the linear version of AR, suggesting a better learned causal effect for smoking 
amount to medical expenditure.


% See appendix for specific test settings.


\begin{figure}[t!]
    \centering
    % \vspace{-0.3cm}
    \includegraphics[width=0.4\textwidth
    % , height=0.2\textwidth
    ]{fig/Smoking_fit_AR.pdf}
    % \vspace{-0.1cm}
    \includegraphics[width=0.5\textwidth
    % , height=0.2\textwidth
    ]{fig/Smoking_PE_AR.pdf}
    % \vspace{-0.23cm}
    \caption{
    % \small
    Fitted models (top)
    % of all estimators (first row) 
    and prediction errors (bottom) 
    % of all estimators 
    for training on male subjects and testing on female subjects.
    % (second row).
    }
    \vspace{-0.15cm}
    \label{fig::sm}
\end{figure}

\section{Conclusion and Future Works}\label{sec:conclusion}
In this work, we consider learning a more general class of causal DAG in a nonlinear setting
% features 
using kernelized anchor regression.
% , which is able to learn the non-linear effect, e.g. piecewise effect.
% the KAR approach. 
By considering different data splitting strategies to estimate the projection operators, 
we show that the three-stage approach not only performs better empirically than baseline approaches as well as the 2SLS approach, but also achieves optimal rate under given conditions. Identifiability results for our approach are provided and are shown to generalize KIV and ``no latent confounder'' scenarios.
% SEM with nonlinear features.
% study both the three-stage and two-stage approach for estimating KAR. We analyze the convergence properties of KAR estimators and the identifiability condition for the causal DAG. We show improved performance for our KAR estimator and demonstrate its usefulness in the real application.
% 
%\wx

{Our study opens several directions to better understand the nonlinear causal effect using the anchor regression framework. For the future, data adaptive choice of $\gamma$ and its causal interpretation can be 
%interesting 
further to explore.
Moreover, while we focus on effect variable $Y$ in its original space in this work, anchor regression for the feature space of $Y$ can be another interesting future study. 
% While the three-stage procedure achieves optimal rate via three regression process, 

We also note that \citet{rothenhausler2018anchor} studies the distribution generalisation property for the linear anchor regression. While, this work does not focus on the theoretical properties of generalisation using RKHS functions in the non-linear setting, the possibility and conditions to achieve distribution generalisation property for non-linear anchor regression is another interesting future direction.
% . This work focuses on investigating the AR framework using RKHS kernel mean embedding, discussing various methods, deriving their useful properties and reporting the empirical performances in synthetic and real data scenarios. As you also mentioned, distribution generalization under the current framework may even not be possible to achieve and potentially novel framework and analysis may be required to study this in non-linear settings. We left the investigation on distribution generalisation and robustness property as a separate future work. We also appreciate the reference on the K-class estimator approach and will include it in the revision.
}











% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
The authors thank Ana Korba and Arthur Gretton for helpful discussions. 
% W.S. is supported by . 
W.X. acknowledges the support from EPSRC grant  EP/T018445/1.
\end{acknowledgements}

% References
% \bibliography{uai2023-template}
\bibliography{shi_139}


\end{document}
