\documentclass{uai2024} 
\usepackage{amsfonts} 
\usepackage{hyperref}
\usepackage{xcolor}
\newcommand{\jin}[1]{\textcolor{blue}{#1}}
\newcommand{\yuta}[1]{\textcolor{red}{#1}}

\begin{document}



Thank you for your constructive comments and suggestions. They are  helpful for us to improve our paper. We will carefully incorporate them in the revised paper. In the following, your comments are first stated and then followed by our responses.

>Comment:
More background on reproducing kernel Hilbert space (RKHS) will be very helpful to those not familiar with these math concepts.\\
The theorems (Thm 4.1, 4.2, 4.3, 4.4, 4.5) on the statistical properties of estimators contain many assumptions. While most of these assumptions were already known from [Newey and Powell, 2003], [Ai and Chen 2003], adding more intuitions/examples can help illustrate the extent of their restrictiveness.

Our response:
Thanks for the feedback. We will add more explanations.

>Comment: I think it'll be more convincing if the paper can include a concrete example where the taking the derivative of $E[Y_x|w]$ computed from (Newey and Powell 2003) yields an incorrect answer when the weaker separability assumption (Assumption 3.2) holds but the previous separability assumption (Newey and Powell 2003) does not.

Our response:
We believe the examples and experiments presented in Section 5 served this purpose. The two SCMs (A) and (B) in Eq. (27) satisfy the weaker separability assumption (Assumption 3.2) but not the separability assumption (Eq. (2)) required by the previous work. The results in Table 1 shows that the estimated
coefficients of P-CAPCE are converging to the true values
when the sample size $N = 10000$, while the coefficient for $W$ estimated by previous method PTSLS is still biased. 
The results in Table 2 shows that the MSE of the previous methods (PTSLS, NTSLS, Kernel IV) are larger than our corresponding methods.

We further performed experiments in settings where the strong separability holds (discussed in the last paragraph in Section 5), and the results (presented in Appendix G) show that the performances
of the existing methods 
are comparable with our proposed methods under this situation.

>Comment:
Are there any restrictions on the functions in the parametric CAPCE estimator? For example, do they need to be finite?

Our response:
As a standard regression, we'd  choose a set of linearly independent functions.
The number of functions are assumed to be finite. 
To use infinite numbers of functions, we should resort to the Sieve estimator.


>Comment: 
For RKHS CAPCE estimator, are 
$\lambda_1\|G_1\|^2$, $\lambda_2\|G_2\|^2$, ... in Eq. (20), (21), (22) regularization terms? Any reason why these specific terms are considered?

Our response:
Yes, $\lambda_1\|G_1\|^2$, $\lambda_2\|G_2\|^2$, ... in Eq. (20), (21), (22) are regularization terms.
These regularization terms restrict the $L_2$ norm of models, and these are standard regularization for kernel ridge regression.


>Comment: 
Before spelling out the details of two stages of Sieve CAPCE estimator, it would be helpful to mention the high-level idea of each stage.

Our response:
Thanks for the feedback. 
We will add explanations along the line of the following sentences:
``In stage 1, we learn models $\hat{E}[Y|Z=z]$ and 
$\hat{E}[\varphi_j(X,W)|Z=z]$  from the datasets by regression. Then in Stage 2, we estimate parameters ${\boldsymbol \beta}$ by solving Eq. (7).''

>Comment: 
Right above equation (5), "$\kappa>(1+d)/2$, and define regularized Sobolev norm..." I don't think $d$ is defined here. Is $d=|w|$?

Our response:
Yes, $d=|w|$. The paper mentions $W$ is $d$-dimensional after Eq. (1); we will add $d=|w|$ before Eq. (5).

\end{document}


%If we use the infinite numbers of functions for the model, we should S-CAPCE estimator.