\documentclass{article}

\usepackage{neurips_2023}
\usepackage{amsfonts} 
\usepackage{xcolor}
\newcommand{\jin}[1]{\textcolor{blue}{#1}}
\newcommand{\yuta}[1]{\textcolor{red}{#1}}





\begin{document}


Thank you for your constructive comments and suggestions. They are very helpful for us to improve our paper. We will carefully incorporate them in the revised paper. In the following, your comments are first stated and then followed by our responses.

Comment 1:

< My understanding is that the cited work [1] also handles the estimation of heterogeneous treatment effect with continuous treatments, without requiring the separability assumption. Subsequently, there have been several follow-ups [2,3,4] that also do not require separability and can handle heterogeneity by additionally conditioning their moment equations on the covariates. I wonder if the authors can provide clarification about their contributions from this perspective. I do see that the proposed method is different than these prior works, but a comparison in terms of both theory and empirics might be more useful.

[1] Syrgkanis, Vasilis, Victor Lei, Miruna Oprescu, Maggie Hei, Keith Battocchi, and Greg Lewis. "Machine learning estimation of heterogeneous treatment effects with instruments." Advances in Neural Information Processing Systems 32 (2019).

[2] Bennett, Andrew, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, and Masatoshi Uehara. "Minimax Instrumental Variable Regression and  Convergence Guarantees without Identification or Closedness." arXiv preprint arXiv:2302.05404 (2023).

[3] Dikkala, Nishanth, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. "Minimax estimation of conditional moment models." Advances in Neural Information Processing Systems 33 (2020).

[4] Muandet, Krikamol, Arash Mehrjou, Si Kai Lee, and Anant Raj. "Dual instrumental variable regression." Advances in Neural Information Processing Systems 33 (2020).

Our response:

To our understanding, this line of work [1, 2, 3, 4] all makes the separability assumption (to achieve identifiability). In this paper's notation, roughly, they assume $Y = f_1(X, W) + f_2(H)$, $E[f_2(H)|Z]=0$. Please see the conditions of Eq. (1) in [1], Eq. (1) in [2], Eq. (1) in [3], and Eq. (1) in [4]. They all assume the separability (of errors/confounders $H$ and the treatment $X$ and covariates $W$). They mostly focus on the efficiency of estimators for simple additive error functions. 
In contrast, this work makes a weaker separability assumption $Y=f_1(X, W) + f_2(W, H)$, and allows complex  non-additive error functions.  One  contribution of this work is that we show the separability on covariates is not needed if we are interested only in CAPCE. 

Comment 2:

< Def 1: Is there a specific reason to push the derivative inside the expectation? I wonder if there are settings where the expected value is differentiable but the $Y_x(U)$ is not. (e.g., when outcomes are binary, which is violated in Assumption 3.1(2))?

Our response:

From the dominated convergence theorem, we can interchange integrals and derivatives and have $\mathbb{E}[\partial_x Y_x|w]=\partial_x\mathbb{E}[Y_x|w]$ since we assume differential and bounded $Y_x$.
Thus, this paper does not deal with the scenarios where $\mathbb{E}[\partial_x Y_x|w]\ne\partial_x\mathbb{E}[Y_x|w]$ such as  non-differential $Y_x$.

Comment 3:

< For Theorem 1, it might be beneficial to provide additional discussion on intuitively what allows the weakening of the separability assumption compared to the works cited therein.

Our response:

Thank you for your advice.
We will provide additional discussion for Theorem 1.
Essentially, estimating CAPCE $\mathbb{E}[\partial_x Y_x|w]$ instead of $\mathbb{E}[Y_x|w]$, which is the focus of the existing works, allows the weakening of the separability assumption.
 Technically, from Assumption 3.2
$$
f_Y(x,W,H,u_Y)
=f_{Y_1}(x,W,u_Y)+f_{Y_2}(W,H,u_Y),
$$
taking differentiating on $x$ on both sides, we have
$$
\partial_xf_Y(x,W,H,u_Y)
=\partial_xf_{Y_1}(x,W,u_Y),
$$
and this eliminates unmeasured confounding bias.  

Comment 4:

< Theorem 4.2 and 4.4, I am wondering if authors have considered using Neyman orthogonality to obtain better rates?

Our response: 

Thank you for your advice. To our understanding, 
Neyman orthogonality will require a much stronger functional assumption than our separability Assumption 3.2, e.g., the assumption in Eq. (1) of [1]. This could be an interesting future work to investigate.

Comment 5:

< It might be helpful to disentangle where the benefits are coming from in Tables 1 and 2. Do the baselines also use a similar model-selection procedure? Conceptually, the main advantage of the proposed method is that it can better model settings where covariates and unobserved confounders can non-linearly impact the outcome.
When such an interaction is absent, how does the proposed method compare to existing methods? Having an ablation where the degree of this interaction is controlled and performance is compared with the baselines, that would be helpful.

Our response:

The model-selection procedure of baselines is similar to our proposed methods. Thank you for your advice on the experiments. We have performed experiments where  the interaction between the covariates and unobserved confounders is absent. The results are shown in the following and confirm that the existing method PTSLS works well under this setting.  
We are performing experiments where the degree of interaction between the covariates and unobserved confounders is  controlled and will provide the results   in the revised paper.

Here we give an additional experiment comparing P-CAPCE  and PTSLS estimators where  the interaction between the covariates and unobserved confounders is absent. 
We change SCM (A) $Y:=10X^2+WX+X+W+50(W^5+W^4+W^3+W^2)U+E_3$ in Eq. (28) to the following ``no interaction" setting $Y:=10X^2+WX+X+W+U+E_3$ while keeping other settings the same.
The mean and standard deviation (SD) of P-CAPCE  and PTSLS estimators are shown in the below tables. The results show that both estimators work well. This shows that the advantage of the proposed P-CAPCE over existing PTSLS displayed in Tables 1 and 2 in the paper stems from that P-CAPCE allows the interaction between the covariates and unobserved confounders while  PTSLS can be severely biased in this setting.  

P-CAPCE Estimator
| N=100       | 1     | W     | X      |N=1000       | 1     | W     | X      |
|-------------|-------|-------|--------|-------------|-------|-------|--------|
|True Coeff.|1|1|20|True Coeff.  | 1     | 1     | 20     |
|Mean|0.944|1.151|19.642|Mean         | 0.999 | 0.966 | 19.998 |
|SD|0.811|5.535|4.884|SD| 0.106 | 4.851 | 0.305  |


PTSLS
|N=100|1|W|X|N=1000|1|W|X|
|-|-|-|-|-|-|-|-|
|True Coeff.|1|1|20|True Coeff.|1|1|20|
|Mean|1.029|0.997|19.474|Mean|1.003|0.939|19.939|
|SD|0.155|1.609|0.782|SD|0.028|0.814|0.118|

Comment 6:

< Adding baselines from point 1 discussed above would also be useful.

Our response:

Thank you for your advice on the experiments. We will evaluate the suitability of comparing with [1, 2, 3, 4]. It appears the problem settings and focus are somewhat different. E.g., [1] assumes the causal effect is linear; [2,3,4] focus on the efficiency of the estimators and consider simple additive error functions.  






\end{document}