\documentclass[
  journal=proceedings,
  manuscript=article-type,
  year=2024
]{PMET_proc}

\usepackage{amsmath}
\usepackage[nopatch]{microtype}
\usepackage{booktabs}
\usepackage{orcidlink}
\usepackage{graphicx}
\usepackage{nicematrix}

\title{Using Nonparametric Regression Trees to Estimate Different Forms of Heterogeneous Treatment Effects}
\author{G. Buhrman \orcidlink{0000-0003-4125-7171}}
\affiliation{Department of Educational Psychology, University of Wisconsin, Madison, Wisconsin, USA}
\email[G. Buhrman]{buhrman@wisc.edu}

\author{X. Liao \orcidlink{0000-0001-8160-3660}}
\affiliation{Educational and Counseling Psychology, and Special Education, University of British Columbia, Vancouver, Canada}

\author{J.-S. Kim \orcidlink{0000-0002-3392-3675}}
\affiliation{Department of Educational Psychology, University of Wisconsin, Madison, Wisconsin, USA}

\addbibresource{references.bib}

\keywords{nonparametric regression, causal inference, heterogeneous treatment effects, Bayesian additive regression trees} %% First letter not capped

\begin{document}

\begin{abstract}
Interest in heterogeneous treatment effects has substantially increased in recent years. Treatment heterogeneity describes the case when individuals are differentially affected by an intervention or exposure according to their characteristics, and accurate estimation of these differential effects can support cost-effectiveness evaluations of interventions and inform policy decisions about which individuals or groups will benefit most from an intervention. However, the functional form of heterogeneous treatment effects can vary and is typically unknown to researchers. For instance, the effect of math tutoring on students’ test scores might vary across students’ prior math scores as a negative quadratic function, meaning that students who benefit most do not have particularly high or low prior scores. Such “Goldilocks” effects and other complex treatment functions have motivated the use of nonparametric regression techniques which make few or no assumptions about the true data generating model. While previous studies have proposed and compared the performance of different nonparametric methods across different datasets, few studies have explicitly explored how the complexity of the functional form of the heterogeneous treatment effects impacts the performance of nonparametric regression tree methods. We initially sought out to explore how the monotonicity of the treatment effect function impacted performance, but present findings that pertain to the overall complexity of the treatment effect function. In these proceedings, we 1) explain why complexity of the treatment effect function is a relevant and important consideration and 2) provide results from a preliminary simulation study which examines how variation in the functional form of treatment effects impacts the accuracy of popular nonparametric regression tree approaches in the context of clustered data with a non-random treatment assignment. We conclude with a discussion of the limitations of the study and possible avenues for future research. Our results suggest that functional complexity, rather than monotonicity, plays a more critical role in the accuracy of nonparametric treatment effect estimators.
\end{abstract}

\section{Introduction}

The fields of social and behavioral sciences are in the midst of a heterogeneity revolution. Initiated by a combination of factors including the replicability crisis, disproportionate attention towards main effects, and the lack of attention towards generalizability, this shift in the field of behavioral sciences has motivated research in recent years to increasingly acknowledge and discuss heterogeneous treatment effects \autocite{bryan_2021, hallsworth_manifesto_2023, yeager_teacher_2022, szaszi_no_2022, walton_where_2023, cikara_hate_2022}. Beyond the revolution, heterogeneous treatment effects are also important to consider when evaluating the cost-effectiveness of interventions or when making policy decisions about which units should receive an intervention.

Heterogeneous treatment effects can be represented as a function of the observed characteristics of individuals and groups in the sample, but this functional form can take a variety of shapes, and the functional form is usually unknown to researchers. In order to best identify who will benefit most from an intervention, and under what conditions, accurate estimation of this functional form is necessary. Previous research has shown that these functional forms can be accurately estimated using nonparametric and machine learning-based methods, but there has been little attention given to how different approaches' results might vary depending on the functional form of the heterogeneous treatment effect. In these proceedings, we attempt to shed light on this area of research by demonstrating the situations where different nonparametric methods return similar or different results depending on the functional form of the heterogeneous treatment effect. We do this by conducting a simulation study using semi-synthetic data generation. to generate clustered data with a non-random treatment assignment.

\subsection{Potential Outcomes Framework in Clustered Data}

We used the Neyman-Rubin potential outcomes framework for causal inference in this study \autocite{neyman_1923, rubin_estimating_1974}. We use the extended notation of potential outcomes for the multilevel structure where units are nested within clusters \autocite{hong_2006, lyu_estimating_2023}. Let us assume that there are $N$ individuals nested within $M$ clusters. In this scenario, let $Y_{ij}(1)$ denote the potential outcome if individual $i$ within cluster $j$ received treatment $(T_{ij}=1)$ and $Y_{ij}(0)$ denote the potential outcome if individual $i$ in cluster $j$ did not receive treatment $(T_{ij}=0)$, where $i=1,...,n_j$ in cluster $j=1,...,M$ and $\sum_{j=1}^Mn_j=N$. Under this framework, the observed outcome can be expressed as

\begin{equation}
\begin{aligned}\label{eq:first}
    Y_{ij}=T_{ij}Y_{ij}(1)+(1-T_{ij})Y_{ij}(0),
\end{aligned}
\end{equation}

\noindent under the \textit{stable unit treatment value assumption}, or SUTVA \autocite{rubin_comment_1986}. SUTVA states that the potential outcomes of individuals are not affected by the treatment assignments of other individuals, and that there are no hidden versions of the treatment. Hong and Raudenbush, and Imbens and Rubin, provide more detail about SUTVA in multilevel settings \autocite{hong_2006, hong_heterogeneous_2013, imbens_causal_2015}. The two potential outcomes $Y_{ij}(1)$ and $Y_{ij}(0)$ can never be observed at the same time for the same individual, meaning that individual treatment effects cannot be calculated outright. However, under certain key assumptions we can express the average treatment effect $\tau$ as

\begin{equation}
\begin{aligned}\label{eq:second}
    \tau = \text{E}[Y_{ij}(1) - Y_{ij}(0)].
\end{aligned}
\end{equation}

\noindent The first assumption is that potential outcomes are independent of treatment assignment $T_{ij}$. This can be achieved via random treatment assignment or by establishing unconfoundedness if treatment assignment was non-random. Often referred to as \textit{(conditional) ignorability} \autocite{rosenbaum_central_1983, rubin_1978}, this assumption states that potential outcomes are independent of treatment assignment, conditional on individual and cluster covariates, $\mathbf{X_{ij}}$ and $\mathbf{Z_j}$, respectively:

\begin{equation}
\begin{aligned}\label{eq:third}
    \text{Unconfoundedness: } {Y_{ij}(1),Y_{ij}(0)} \perp T_{ij} \ | \ \mathbf{X_{ij}}, \mathbf{Z_{j}}
\end{aligned}
\end{equation}

\noindent The second assumption necessary for valid causal inference is that the probability that an individual is assigned to either treatment condition, given their underlying characteristics or covariates, is strictly between 0 and 1:

\begin{equation}
\begin{aligned}\label{eq:fourth}
    \text{Positivity: } 0 <e(\mathbf{X_{ij}},\mathbf{Z_{j}})=\text{Pr}(T_{ij}=1 \ | \mathbf{X_{ij}}, \mathbf{Z_{j}}) <1,
\end{aligned}
\end{equation}

\noindent where $e(\mathbf{X_{ij}}, \mathbf{Z_{j}})$ is the \textit{propensity score} \autocite{rosenbaum_central_1983}. 

\subsection{Heterogeneous Treatment Effects and CATE Estimation}

Conditional average treatment effects, or CATEs, are a type of treatment effect that can be estimated to quantify treatment heterogeneity \autocite{imbens_causal_2015}. If we believe that a treatment effect may vary across the observed individual-level $(\mathbf{X_{ij}})$ and cluster-level $(\mathbf{Z_{j}})$ covariates, then the CATEs can be estimated as

\begin{equation}
\begin{aligned}\label{eq:fifth}
    \tau_{ij} = \text{E}[Y_{ij}(1) - Y_{ij}(0) \ | \ \mathbf{X_{ij}}, \mathbf{Z_{j}}].
\end{aligned}
\end{equation}

\noindent If we could estimate the treatment effect for each individual $i$ within cluster $j$, conditional upon the individual-level and cluster-level covariates we would have a vector of individual treatment effects (ITEs), denoted as $\tau_{ij}$, conditional upon each individual's characteristics. These ITEs can be analyzed and visually inspected to determine whether there are heterogeneous treatment effects. 

\subsection{Motivating Problem}

In a previous study by Kim, Liao, and Loh \autocite{kim_assessing_2024}, the authors estimated the ITEs for students in the 2015 Korea TIMSS data conditional upon a variety of individual-level and cluster-level covariates. Notably, they found evidence of a cross-level interaction between private tutoring and school resource shortages for math instruction (see Figure 1). The authors also found evidence of a second cross-level interaction between private tutoring and the school's emphasis on academic success (see Figure 2).

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.75\linewidth]{TIMSS_CATE_ResShort.pdf}
\caption{CATE estimates of the impact of private tutoring on students’ TIMSS mathematics scores with respect to school-level resource shortage.}
\label{fig_TIMSS_CATE_ResShort}
\end{figure}

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.75\linewidth]{TIMSS_CATE_AcadEmph.pdf}
\caption{CATE estimates of the impact of private tutoring on students’ TIMSS mathematics scores with respect to school-level emphasis on academic success.}
\label{fig_TIMSS_CATE_AcadEmph}
\end{figure}

Two important observations from these findings are that 1) the functional form of the heterogeneous treatment effect varies (in Figure 1, the form seems to be quadratic, while in Figure 2 the form seems to be linear), and 2) depending on the method you used (CF: Causal Forest; BART: Bayesian Additive Regression Trees; X-RF: X-Learner with Random Forests; X-BART: X-Learner with BART), the amount of heterogeneity you would ascribe to these interactions would vary. These observations are the motivation for the current study. Specifically, we wanted to explore how variation in the functional form of treatment effects impacts the accuracy of popular nonparametric regression tree approaches in the context of clustered data with a non-random treatment assignment. In the following sections, we provide a brief overview of how CATEs can be estimated with nonparametric regression and detail the results from a preliminary simulation study.

\section{CATE Estimation via Nonparametric Regression}

Without pre-existing knowledge of subgroups and the functional form of heterogeneous treatment effects, CATEs can become difficult to estimate, especially in data with many covariates where interactions between covariates and treatment are potentially numerous and complex.

Nonparametric regression methods, specifically nonparametric regression tree-based methods, are a useful approach for estimating CATEs because they are "agnostic" to the functional form of heterogeneous treatment effects and they can consider a large number of covariates as potential characteristics upon which treatment may vary. In general, nonparametric regression tree-based methods start with the supposition that the observed outcomes can be defined as the output from some “unknown” function $f(\cdot)$:

\begin{equation}
\begin{aligned}\label{eq:sixth}
    Y_{ij} = f(\mathbf{X_{ij}}, \mathbf{Z_j}, T_{ij})+\epsilon_{ij},
\end{aligned}
\end{equation}

\noindent where $\mathbf{X_{ij}}$ is a matrix of individual-level covariates, $\mathbf{Z_j}$ is a matrix of cluster-level covariates, $T_{ij}$ is an $N \times 1$ column vector of the binary treatment assignment, and $\epsilon_{ij}$ is some error term with $\text{E}[\epsilon_{ij}]=0$ and no distributional assumptions. Nonparametric regression tree-based methods vary from each other in a variety of ways, but in general, their distinguishing characteristics can be organized into four broad domains:

\begin{enumerate}
  \item The components included in the functional definition of the outcome.
  \item CATE estimation as "counterfactual prediction" vs. "effect estimation"
  \item The type and targets of regularization.
  \item The statistical framework (Bayesian vs. Frequentist)
\end{enumerate}

The first domain refers to variations of the expression in equation 4. For instance, some methods may include cluster identification or ID as an additional input in $f(\cdot)$ resulting in

\begin{equation}
\begin{aligned}\label{eq:seventh}
    Y_{ij} = f(\mathbf{X_{ij}}, \mathbf{Z_{j}}, T_{ij}, j)+\epsilon_{ij},
\end{aligned}
\end{equation}

\noindent where $j$ now operates in a manner akin to specifying cluster ID as a fixed effect in a parametric regression model. 

The second domain refers broadly to the "learner" or approach taken to estimate CATEs \autocite{kunzel_metalearners_2019, caron_estimating_2022}. Tran and colleagues coined the terms "counterfactual prediction" and "effect estimation" as ways to refer to the most common approaches for estimating CATEs with nonparametric methods \autocite{tran_data-driven_2024}. Counterfactual prediction refers to approaches that estimate two models, $f_1(\mathbf{X_{ij}}, \mathbf{Z_{j}}, T_{ij}=1)$ and $f_0(\mathbf{X_{ij}}, \mathbf{Z_{j}}, T_{ij}=0)$, for treated and control units respectively. These fitted models can then be used to predict the unobserved potential outcome for each unit. Taking the difference between the observed and predicted potential outcomes gives us an estimate of the ITE conditioned upon the individual-level and cluster-level covariates. Examples of methods that use counterfactual prediction are BART as a T-learner \autocite{hill_bayesian_2011} and stan4bart (S4BART) \autocite{dorie_stan_2022}. 

Effect estimation refers to approaches that directly estimate the CATEs by re-specifying the formula in equation 6 to take the general form of

\begin{equation}
\begin{aligned}\label{eq:eighth}
    Y_{ij} = f(\mathbf{X_{ij}}, \mathbf{Z_{j}}, T_{ij})+\epsilon_{ij} = \mu(\mathbf{X_{ij}}, \mathbf{Z_{j}}) + \tau(\mathbf{X_{ij}}, \mathbf{Z_{j}}) \times T_{ij} + \epsilon_{ij}, 
\end{aligned}
\end{equation}

\noindent where $\mu(\mathbf{X_{ij}}, \mathbf{Z_{j}})$ is a function that gives the prognostic outcome, and $\tau(\mathbf{X_{ij}}, \mathbf{Z_{j}})$ is a function that gives the individual's CATE. The function $\tau(\mathbf{X_{ij}}, \mathbf{Z_{j}})$ can be estimated using nonparametric regression tree-based approaches, such as BART. Examples of methods that use effect estimation are Bayesian Causal Forest (BCF) \autocite{hahn_bayesian_2020} and CF \autocite{wager_estimation_2018}. 

The third domain refers to which parts of the estimation algorithm or function include regularization and what type of regularization procedure is utilized. In BART, overfitting is mitigated by nature of the approach being an “ensemble-method” where many “weak-learners” are combined to provide a full picture of the data. BART is able to combine many small regression trees, and these trees are kept shallow, meaning they have a few numbers of cut/decision points, via a Bayesian prior. In this situation, the target of regularization are the regression trees, and the type of regularization is a Bayesian shrinkage prior. 

The final domain, the statistical framework chosen, is self-explanatory, but also has implications for the third domain, as working within a certain framework gives access to different types of regularization procedures. Namely, that working in a Bayesian framework allows for the use of Bayesian shrinkage priors for regularization. In the current study, we identified methods that were both popular and varied across these four domains in meaningful ways. Namely, we consider CF \autocite{wager_estimation_2018}, BCF \autocite{hahn_bayesian_2020}, Sparse Bayesian Causal Forest (SBCF) \autocite{caron_shrinkage_2022}, and S4BART \autocite{dorie_stan_2022}. The differences between these selected methods are summarized in Table 1 and Table 2.  For further details about each of these approaches, see their respective references. 


\begin{table}
    \centering
    \resizebox{0.95\textwidth}{!}{%
    \begin{NiceTabular}{|l|l|l|l|} \hline
         Method&CATE Estimation&  Regularization Types&  Regularization Targets\\ \hline 
         BCF-FE&Effect Estimation&  BART prior&  CATE, prognostic outcome, trees\\ \hline 
         SBCF-FE&Effect Estimation&  BART prior \& Dirichlet Prior&  CATE, prognostic outcome, splitting, trees\\ \hline 
         CF-FE&Effect Estimation&  Adaptive kernel weighting&  Nuisance functions, CATE\\ \hline 
         S4BART&Counterfactual Prediction& BART prior& Trees\\ \hline
    \end{NiceTabular}
    }
    \caption{Selected methods' estimation and regularization approaches}
    \label{tab:table1}
\end{table}

\begin{table}
    \centering
    \begin{NiceTabular}{|l|l|} \hline 
         Method& Outcome Specification\\ \hline 
         BCF-FE& $Y_{ij}=\mu(\mathbf{X_{ij}}, \mathbf{Z_{j}},j)+\tau(\mathbf{X_{ij}}, \mathbf{Z_{j}},j) \times T_{ij}+\epsilon_{ij}$\\ \hline 
         SBCF-FE& $Y_{ij}=\mu(\mathbf{X_{ij}}, \mathbf{Z_{j}},j)+\tau(\mathbf{X_{ij}}, \mathbf{Z_{j}},j) \times T_{ij}+\epsilon_{ij}$\\ \hline 
         CF-FE& $Y_{ij}=(T_{ij}-\pi(\mathbf{X_{ij}}, \mathbf{Z_{j}},j))\tau(\mathbf{X_{ij}}, \mathbf{Z_{j}},j)-m(\mathbf{X_{ij}}, \mathbf{Z_{j}},j) +\epsilon_{ij}$\\ \hline 
         S4BART& $Y_{ij}=f(\mathbf{X_{ij}}, \mathbf{Z_{j}},T_{ij})+U_{0j}+\epsilon_{ij}$\\ \hline
    \end{NiceTabular}
    \caption{Selected methods' outcome specifications}
    \label{tab:table2}
\end{table}

\section{Simulation: Semi-Synthetic TIMSS data}

We conducted a simulation study to answer the following questions: (1) for what functional forms of heterogeneous treatment effects will different nonparametric methods agree and (2) for what forms of heterogeneous treatment effects will different nonparametric methods disagree? The data contexts in which we are interested are observational studies with clustered data, so we wanted to use a data generating procedure that would mimic these situations. To accomplish this, we used a semi-synthetic data generation process \autocite{hill_bayesian_2011, buhrman_exploring_2024}. Our semi-synthetic approach differs from previous approaches in that we generated covariates based on the covariance structure of real data rather than randomize or sample from real data. Specifically, we used covariates from the 2019 United States TIMSS data to obtain the covariance structure used in our simulation. All analyses were performed using R Statistical Software \autocite{r_software}. Implementation of BCF and SBCF was performed using the SparseBCF package \autocite{sparseBCF}, implementation of CF was performed using the grf package \autocite{grf}, and implementation of S4BART was performed using the stan4bart package \autocite{stan4bart}.

\subsection{Data and Variables}

The 2019 United States TIMSS data includes several context variables for both students and schools. Student and family-related covariates we used to generate a covariance structure for the data generation process include student gender, household socioeconomic status, student's confidence in math, student's fondness for math, student's value of math, and the number of absences a student had. School- or cluster-level covariates include the percentage of male students, the average SES of students, the school's emphasis on academic success, the school's strictness in disciplinary policies, and the degree to which mathematics instruction was affected by school resource shortage. We school-mean centered student-level covariates prior to obtaining the covariance structures for student- and school-level covariates. Using these covariance structures, we generated clustered data with random intercepts. 

We generated data for 30 clusters with cluster size ranging from 22 to 38 with an average cluster size of 30. The unconditional interclass correlation (ICC) was 0.15. Our goal was to generate data for the effect of some non-random individual-level treatment assignment on student's math performance, where this treatment effect followed different functional forms according to some other covariate. We generated an individual-level non-random treatment assignment, which we framed as student participation in extra-curricular math activities like Math Olympiad, based on the following propensity score function

\begin{equation}
\begin{aligned}\label{eq:ninth}
\resizebox{0.9\textwidth}{!}{$
    \pi_{ij} = \text{logit}(-0.25 - 0.25Absences_{ij} + 0.25SES_{ij} +0.5Confidence_{ij} + 0.25Emphasis_{j} - 0.75Shortage_j + W_j),
    $}
\end{aligned}
\end{equation}

\noindent where $\pi_{ij}$ is the probability that student $i$ in school $j$ participates in an extra-curricular math activity, $W_j\sim N(0, 0.01)$ and the binary treatment indicator is $T_{ij} \sim \text{Bernoulli}(\pi_{ij})$.

We also generated heterogeneous treatment effects based on three different functional forms: (1) linear, (2) quadratic, and (3) logistic. Each form can be thought of as a cross-level interaction between a school-level covariate and the treatment indicator. These can be expressed as

\begin{equation}
\begin{aligned}\label{eq:tenth}
    \tau_{ij(linear)} = \frac{Shortage_j + 2}{8},
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}\label{eq:eleventh}
    \tau_{ij(quadratic)} = \frac{-Shortage_j^2 + 5}{10},
\end{aligned}
\end{equation}

\begin{equation}
\begin{aligned}\label{eq:twelth}
    \tau_{ij(logistic)} = \frac{0.5}{1+e^{-Shortage_j / 0.25}}.
\end{aligned}
\end{equation}

\noindent We generated 1000 iterations of data under each of these functional form conditions and estimated the ITEs conditional on covariates for each iteration using each of the four methods previously described. This left us with the true ITEs conditional on covariates and the estimated ITEs conditional on covariates for 1000 iterations of each functional form specification. As an aside, we estimated propensity scores the same way for every method, using BART with random intercepts.

To quantify the performance of each method, we used the Precision of Estimation of Heterogeneous Effects (PEHE) \autocite{hill_bayesian_2011} to evaluate and compare methods' performances. The PEHE is a measure of both bias and variance of the estimated ITEs conditioned on covariates. However, because our simulated data are observational, it is possible for there to be regions of non-overlap for certain covariates. Estimating CATE on regions of non-overlap produces biased estimates, so we use PEHE on treated, or PEHET, as the evaluation criteria. PEHET can be expressed as

\begin{equation}
\begin{aligned}\label{eq:thirteenth}
    \text{PEHET} =\sqrt{\frac{1}{N_T}\sum_i^{N_T}(\tau_{i}-\hat{\tau_{i}})^2},
\end{aligned}
\end{equation}

\noindent where $N_T$ is the number of treated units, $\tau_{i}$  is the true ITE conditional on covariates for individual $i$, and $\hat{\tau}_i$ is the estimated ITE conditional on covariates for individual $i$.

\subsection{Results}

We find that stan4bart consistently recovers the true heterogeneous treatment effect with greater accuracy and less variance compared to the other three methods (see Table 3 and Figure 3). We also observe that all the methods were most accurate when the heterogeneous treatment effect took a linear form, and least accurate when the treatment effect took a logistic form (see Figure 3). This was especially the case for Causal Forest, for which we observe the distribution of PEHETs under the logistic form condition is highly separated from the distributions of PEHETs under the linear and quadratic form conditions. Recalling our initial interest in determining whether the monotonicity of the functional form mattered for heterogeneous treatment estimation, it initially looks like monotonicity does matter, until you consider the complexity of the functional form. If you were to only compare linear and quadratic forms, you would come to the conclusion that it is more difficult for nonparametric regression tree methods to estimate non-monotonic functional forms. However, when you consider the logistic form, which is monotonic, we can see that it is functional complexity, not monotonicity, driving the pattern observed in the simulation results.

\begin{table}
    \centering
    \resizebox{0.95\textwidth}{!}{%
    \begin{NiceTabular}{|c|c|c|c|c|c|c|} \hline 
 & \multicolumn{2}{|c|}{Linear Form}& \multicolumn{2}{|c|}{Quadratic Form}& \multicolumn{2}{|c|}{Logistic Form}\\ \hline 
         &  Mean (SD)&  Median [Min, Max]&  Mean (SD)&  Median [Min, Max]&  Mean (SD)&  Median [Min, Max]\\ \hline 
         BCF-FE&  0.147 (0.040)&  0.140 [0.076, 0.313]&  0.184 (0.044)&  0.177 [0.082, 0.423]&  0.205 (0.040)&  0.203 [0.096, 0.364]\\ \hline 
         SBCF-FE&  0.165 (0.044)&  0.161 [0.051, 0.366]&  0.202 (0.047)&  0.198 [0.057, 0.410]&  0.226 (0.054)&  0.232 [0.063, 0.396]\\ \hline 
         CF-FE&  0.158 (0.031)&  0.152 [0.086, 0.317]&  0.175 (0.046)&  0.169 [0.084, 0.413]&  0.238 (0.043)&  0.242 [0.084, 0.389]\\ \hline 
         S4BART&  0.135 (0.040)&  0.130 [0.051, 0.313]&  0.156 (0.042)&  0.149 [0.066, 0.360]&  0.153 (0.045)&  0.150 [0.051, 0.298]\\ \hline
    \end{NiceTabular}
    }
    \caption{PEHET statistics for each method and functional form condition across 1000 iterations}
    \label{tab:table3}
\end{table}

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.95\linewidth]{figure_violins_PEHE_rev.png}
\caption{Distributions of PEHETs by method for each functional form condition}
\label{fig_simResults}
\end{figure}

\section{Discussion}

Estimation of heterogeneous treatment effects has become an increasingly important topic of research, especially in clustered data where the contexts of membership in different clusters may change the degree to which individuals benefit from an intervention. Nonparametric regression trees are a popular technique for estimating heterogeneous treatment effects because of their flexibility and usability. However, little research has paid attention to how the functional form of the treatment effect affects the performance of specific nonparametric methods and the family of methods overall. In a preliminary study, we initially hypothesized that monotonicity of the functional form may impact the accuracy and variance of ITE estimation, but found evidence that functional complexity was the driving factor for differences across all the methods we investigated. This chapter details these findings and outlines the key differences between the methods we investigated. 

The goal of this work is to find conditions in which the results of different methods are consistent and to compare these to conditions where the results from different methods are inconsistent. By finding these conditions, we can begin to identify the characteristics of methods which might be most relevant for certain conditions. We can apply these findings to practice in the form of method selection when the objective is to estimate heterogeneous treatment effects in plausible scenarios, including observational data with uneven treatment allocation, data with multiple sources of treatment heterogeneity, and clustered data with small or varying cluster sizes.

For the findings presented in this chapter, we suspect that the differences in the accuracy and variance of ITE estimation can be explained with the four domains we outlined in section 2. Specifically, we hypothesize that the types and targets of regularization and the way that CATE is estimated play the largest role in observed differences depending on the condition of functional form. Based on the results of this study, we recommend the use of stan4bart for general purpose estimation of heterogeneous treatment effects from multilevel data when treatment is assigned at the lowest level. However, stan4bart makes a distributional assumption to model the variance between clusters, meaning that users should still perform diagnostic checks on the parametric component of the model. Future work in this area could consider the four domains we have described and should consider functional forms that include multiple covariates. Furthermore, future research should continue to explore how nonparametric regression tree methods can be best applied in the context of multilevel data. 

\section*{Acknowledgment} 
The authors thank proceedings editor Okan Bulut for his helpful comments.

\paragraph{Funding Statement}

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Award \#R305B200026 to the University of Wisconsin-Madison. The opinions expressed are those of the authors and do not represent views of the U.S. Department of Education.

\paragraph{Competing Interests}

None.

\printbibliography

\end{document}