\documentclass{article}

\usepackage{neurips_2023}
\usepackage{amsfonts} 
\usepackage{xcolor}
\newcommand{\jin}[1]{\textcolor{blue}{#1}}
\newcommand{\yuta}[1]{\textcolor{red}{#1}}

\begin{document}


Thank you for your constructive comments and suggestions. They are very helpful for us to improve our paper. We will carefully incorporate them in the revised paper. In the following, your comments or questions are first stated and then followed by our responses.

Comment 1:

My major concern is the novelty of this paper. In my opinion, the identification formula is the most significant result since given that integral equation, the sieve, parametric, and RKHS estimators are all pretty standard. However, the identification formula seems a quite straightforward extension of that in the existing literature about average partial effects, e.g., Wong (2022) and Kawakami et al. (2023). Note that Kawakami et al. (2023) is very closely related, but it is not cited in this paper.

Our response:

Thanks for pointing out (Kawakami et al., 2023). We note that (Kawakami et al., 2023) appeared in this year's ICML and was not available when this work was submitted to NeurIPS. We will discuss the relation with (Kawakami et al., 2023) in the revised paper. 

We discuss the novel contributions of this paper relative to the existing works in the following.

The first main contribution of this work is that we provide a novel condition for identifying CAPCE. Importantly, we show that CAPCE is identifiable under a weaker separability assumption (Assumption 3.2)
$f_Y(X,W,H,u_Y)
=f_{Y_1}(X,W,u_Y)+f_{Y_2}(W,H,u_Y).$
Assumption 3.2 is less restrictive, particularly when there are many covariates $W$, than the standard separability assumption Eq. (2)
$f_Y(X,W,H,u_Y)=f_{Y_1}(X,W,u_Y)+f_{Y_2}(H, u_Y)$ 
required by many existing works (Newey and Powell, 2003; Wooldridge 2010; Singh et al. 2019, etc.). We note that estimating CAPCE $\mathbb{E}[\partial_x Y_x|w]$ instead of $\mathbb{E}[Y_x|w]$ allows the weakening of the separability assumption; in other words, CAPCE is identifiable under a weaker assumption than $\mathbb{E}[Y_x|w]$. Our work shows the importance and merit of studying CAPCE instead of $\mathbb{E}[Y_x|w]$,  which has been the focus of most existing works. We note that this important point about weaker separability assumption does not arise in the work of Wong (2022) and Kawakami et al. (2023) because they study the setting without the covariates $W$.

The second main contribution of this work is that we develop standard parametric, sieve, and RKHS families of estimation methods for CAPCE. We acknowledge that these families of estimators have been well-studied for $\mathbb{E}[Y_x|w]$; still, the proposed identification method of CAPCE is new, and the derivation and analysis of the corresponding estimators are necessary and nontrivial. 
In comparison with the estimation methods in (Kawakami et al., 2023), this work investigates CAPCE $\mathbb{E}[\partial_x Y_x|w]$, which is a generalization of APCE $\mathbb{E}[\partial_x Y_x]$ studied by Kawakami et al. (2023) to represent heterogeneous causal effects. (Kawakami et al., 2023) presented parametric method and Picard iteration-based estimators. The sieve and RKHS estimators developed in this paper are not provided in (Kawakami et al., 2023). 
We note that Picard iteration-based estimator is not suitable for solving the integral Eq. (3)  due to the use of a density function in the integral kernel instead of a CDF. 

We hope the discussion above has addressed your concern about the novelty of this paper. Relative to most existing works focusing on $\mathbb{E}[Y_x|w]$, this work develop identification and estimation methods for CAPCE $\mathbb{E}[\partial_x Y_x|w]$. It extends and completes the results in (Wong, 2022) and (Kawakami et al., 2023) for $\mathbb{E}[\partial_x Y_x]$ where sieve and RKHS estimators were not developed.

Comment 2:

Additionally, section 4 of this paper appears a bit redundant. The parametric setting in section 4.2 can follow very straightforwardly from the sieve setting in section 4.1. So I think section 4.2 can be shortened substantially, maybe to a couple of paragraphs as an extra remark at the end of section 4.1. The paper can use the space saved from shortening section 4.2 to better present other parts. In particular, the paper may describe the smoothness condition and the concrete convergence rate more explicitly, and also highlight the effect of the degree of ill-posedness. Currently, these important details are pretty much all hidden.

Our response:

Thank you for your advice. We will carefully explore the options for shortening  Section 4.2 and improving the presentation.

Question 1:

The proposed method requires a reference point $z_0$. How may the choice of $z_0$ affect the performance of the proposed method?

Our response:

The choice of a reference point $z_0$ does not affect the consistency results or rate of convergence, but it may affect the variance of the estimator. In our experiments, we take the minimum value of $Z$ as  a standard reference point $z_0$. The choice of a reference point $z_0$ did not affect the SD of the estimators much in our experiments.

Question 2:

On line 139, it is stated that $\varphi_j$ is the anti-derivative of $\phi_j$.
How is this anti-derivative obtained in practical implementation?
Can the authors clarify this? 
Also, the definition $\varphi_j(x,w)=\int \phi_j(x,w)dx$ does not seem right. 
Does it mean $\varphi_j(x,w)=\int_{-\infty}^x \phi_j(x',w)dx'$?

Our response:

In the implementation, the anti-derivatives of widely-used basis functions such as Hermite polynomials are not hard to obtain. For example, the anti-derivatives of the third term of Hermite polynomials is
$$
\int (8x^3-12x)dx =2x^4-6x^2+C.
$$

We simply write the antiderivative $\varphi_j(x,w)=\int_{-\infty}^x \phi_j(x',w)dx'$ as $\varphi_j(x,w)=\int \phi_j(x,w)dx$ in the paper because 
the constant of integration is irrelevant since we take the difference between the antiderivatives.

Question 3:

Moreover, RKHS should be quite similar to the sieve approach. How come there is no analog of the anti-derivative in the RKHS approach? In particular, according to line 232, the conditional expectation of $\pi(X,W)$ needs to be estimated. Why not the conditional expectation of the anti-derivative of $\pi(X,W)$?

Our response:

The feature map  $\pi(x,w)$  is indeed an antiderivative function, and   $k_{X,W}$ in (25) is an antiderivative kernel function. The details are in Appendix A.2, where 
 $\pi(x,w)$ is represented as an antiderivative function in line 48.  We are sorry for the confusion. We will revise the sentence in line 226 in the paper
 
``
Denote the feature map 
$\pi: \Omega_{X,W}  \rightarrow H_{X,W}, (x,w) \mapsto k_{X,W}(x,w,\cdot,\cdot)$ and $\psi: \Omega_Z  \rightarrow H_Z, z \mapsto k_Z(z,\cdot).$" 

to the following:

``Denote the feature map
$\eta: \Omega_{X,W}  \rightarrow H_{X,W}, (x,w) \mapsto k'_{X,W}(x,w,\cdot,\cdot)$ and $\psi: \Omega_Z  \rightarrow H_Z, z \mapsto k_Z(z,\cdot)$. 
In addition, we denote the antiderivative feature function $\pi(x,w)=-\int_{-\infty}^x \eta(x',{ w}) dx'$ and the antiderivative kernel function $k_{X,W}(x,w,x',w')= \int \int k'_{X,W}(x,w,x',w')dxdx'$."  

Note that we  calculate the antiderivative kernel function easily and explicitly just by taking the antiderivative of the kernel function based on Fubini's theorem:
$$
<\pi(x,w),\pi(x',w')>=\int\int<\eta(x,w),\eta(x',w')>dxdx'.
$$

We have provided a detailed derivation of the RKHS CAPCE estimator in Appendix A.2. We hope this detailed derivation could address your concern about the RKHS estimator.

Question 4:

In Section 5 figure 2(a), somehow the PTSLS method is severely biased even for $N=10000$. But the PTSLS parametric model is actually well-specified so the observed high bias is a purely finite sample phenomenon, right? It is somewhat surprising that the bias of PTSLS is so high.

Our response:

No, this high bias of PTSLS is not a finite sample phenomenon but is caused by that the experiment setting Eq. (28 A) violates the stronger separability  Eq. (2) assumed by the PTSLS estimator due to the interaction terms between the covariate $W$ and unobserved confounder $U$. This high bias of PTSLS goes away in the experiments where these interaction terms are removed. 
The results demonstrate the practical usefulness of the results presented in this paper, showing the merits of the proposed methods for directly estimating  CAPCE $\mathbb{E}[\partial_x Y_x|w]$, which is identified under a weaker separability assumption than $\mathbb{E}[Y_x|w]$ which has been the focus of existing work including PTSLS.

Comment: 

One additional minor comment: In $\mu(z)$ on line 109, $z$ and $z_0$ should be switched.

Our response: 

Thank you for the comment.
But, $\mu(z)$ on line 109 is correct.
It means
$$
\mathbb{E}[Y|Z=z]-\mathbb{E}[Y|Z=z_0]=-\int_{\Omega_{W}}\int_{\Omega_X} \\{\mathfrak{p}(X\leq x,{W}={w}|Z=z)-\mathfrak{p}(X\leq x,{W}={w}|Z=z_0)\\}\mathbb{E}[\partial_x Y_{x}|{w}] dxd{w}.
$$



\end{document}