%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{amssymb}

\usepackage{bbm}
\usepackage{tikz}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{natbib}
\usepackage{multirow}
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\renewcommand\thesubfigure{\roman{subfigure}}
\newcommand{\A}[1]{{\color{blue}[\textbf{AA:}#1]}}
\newcommand{\JD}[1]{{\color{red}[\textbf{JD:}#1]}}
\usepackage{algorithm}
\usepackage{algcompatible}
\usepackage{footnote}  % To handle footnotes properly
\usepackage{wrapfig}

\newtheorem*{proposition1}{\textbf{Proposition~\ref{thm:ideal_pb}}}
\newtheorem*{proposition2}{\textbf{Proposition~\ref{thm:practical_pb}}}

\usepackage{tcolorbox}

\newcommand{\cov}{\mathrm{Cov}}
\newcommand{\ind}{\perp\!\!\!\!\perp} 

\newcommand{\CF}{\text{\tiny CF}}
\newcommand{\cF}{\mathcal{F}}
\newcommand{\F}{\text{\tiny F}}
 \newcommand{\AF}{\text{\tiny AF}}
\newcommand{\RCT}{\text{\tiny RCT}}
\newcommand{\ATE}{\text{\tiny ATE}}
\newcommand{\PEHE}{\text{\tiny PEHE}}
\newcommand{\IFF}{\text{IF}}
\newcommand{\ICF}{\text{ICF}}

\title{Conditional Average Treatment Effect Estimation Under Hidden Confounders}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Ahmed Aloui}
\author[1]{Juncheng Dong}
\author[1,2]{Ali Hasan}
\author[1]{Vahid Tarokh}
% Add affiliations after the authors
\affil[1]{%
Department of Electrical and Computer Engineering, Duke University}
\affil[2]{%
    Machine Learning Research, Morgan Stanley
}
\begin{document}
\maketitle

\begin{abstract}
One of the major challenges in estimating conditional potential outcomes and conditional average treatment effects (CATE) is the presence of hidden confounders. Since testing for hidden confounders cannot be accomplished only with observational data, conditional unconfoundedness is commonly assumed in the literature of CATE estimation. Nevertheless, under this assumption, CATE estimation can be significantly biased due to the effects of unobserved confounders. In this work, we consider the case where in addition to a potentially large observational dataset, a small dataset from a randomized controlled trial (RCT) is available. 
Notably, we make no assumptions on the existence of any covariate information for the RCT dataset, we only require the outcomes to be observed. We propose a CATE estimation method based on a pseudo-confounder generator and a CATE model that aligns the learned potential outcomes from the observational data with those observed from the RCT. Our method is applicable to many practical scenarios of interest, particularly those where privacy is a concern (e.g., medical applications). Extensive numerical experiments are provided demonstrating the effectiveness of our approach for both synthetic and real-world datasets.
\end{abstract}


\section{Introduction}
\label{sec:intro}
Estimating treatment effects is of significant interest to various scientific communities, such as in medicine~\citep{glass2013causal,feuerriegel2024causal} and social sciences~\citep{imbens2015causal,imbens2024causal} for assessing the efficacy of a policy. 
Recently, various methods have been developed using machine learning to estimate individual-level treatment effects, also known as the \emph{conditional average treatment effects} (CATE)~\citep{shalit,alaa,athey,shi2019adapting,guo2023estimating,schweisthal24a,fang2024causalstonet}.
% \textbf{Confounding Problem.} 
While these methods have proven successful, their effectiveness in estimating treatment effects can be significantly compromised in real-world applications due to the confounding problem\citep{kallus2019interval,chor2024three}. Confounders are variables that influence both the treatment and the outcome. If not properly controlled for, they can severely bias the potential outcome and treatment effect estimations~\citep{rosenbaum1983}. While it is well-established that treatment effects are identifiable under the assumption of \emph{conditional unconfoundedness} (that is, no hidden confounders), \emph{estimating conditional treatment effects becomes much more challenging under {unobserved} confounders}~\citep{imbens2015causal,kallus2018confounding}. 
In some ideal scenarios like Randomized Controlled Trials (RCTs), conditional unconfoundedness might be achieved by design. However, these experiments often require an expensive data collection process. Furthermore, \emph{the conditional unconfoundedness assumption is inherently not falsifiable from observational data alone}~\citep{popper2005logic}. For instance, passively collected healthcare databases often lack essential clinical details that can influence treatment decisions made by both doctors and patients, such as subjective evaluations of the severity of a condition or personal lifestyle factors. Consequently, when applying causal inference models to observational data, it is common to assume conditional unconfoundedness, which may fail to hold in practice and cannot be tested. This can cause significant bias in potential outcome estimation.


\textbf{Problem Setting.} In this work, we propose a novel approach to mitigate the bias in estimating CATE under hidden confounders.
Our analysis begins by considering a scenario in which both observational data and RCT data are present -- a common situation in many fields, such as in healthcare, where large observational datasets with rich features (e.g., electronic health records) are readily available, but RCTs are expensive and often too small to support complex models for learning CATE.
In particular, we consider scenarios where only the outcomes from a small batch of RCTs are available alongside observational datasets, circumventing the requirements for individual covariates from RCTs. Such scenarios are plausible in real-world applications where:

\begin{itemize}
    \item \textit{Privacy restrictions:} Full access to detailed features may be unavailable due to privacy concerns. For example, covariates cannot be shared between institutions. Consider a European hospital and a U.S. hospital collaborating on cancer treatment outcomes. Due to GDPR in Europe and HIPAA in the U.S., raw patient features (e.g., genetic profiles, detailed medical history) cannot be shared. However, aggregated outcomes (e.g., survival rate, remission status) can be exchanged. In this case, the goal is to use the available outcomes from the partner site to balance and deconfound a model trained on local (observational) patient features, without ever accessing the full covariates from the RCT data.
    
    \item \textit{Mismatched covariates} between RCT and observational data: For example, an RCT studying the effect of a new diabetes intervention may have collected detailed clinical measurements (e.g., insulin sensitivity markers), while the current hospital EHR system includes only demographic and basic lab results (e.g., A1C, fasting glucose). In this case, we can't directly use the RCT model because the features don’t align, but we can still use the outcomes from the RCT to help deconfound the observational data, e.g., by regularizing predictions or mitigating confounding.
    
    \item \textit{Old RCT outcomes} with no collected features: For instance, an old clinical trial for a hypertension drug conducted in the 1990s might have preserved only treatment assignments and blood pressure outcomes, while detailed patient covariates (e.g., age, BMI, comorbidities) were not digitized or are no longer accessible. Meanwhile, modern electronic health records (EHRs) include rich observational features but lack experimental data. In this case, we can use the outcome distribution from the old RCT to improve robustness or reduce confounding in models trained on the newer EHR data.
\end{itemize}


% In particular, addressing such scenarios enables us to overcome a key challenge: when the features of RCT data differ from those of observational data, we can still leverage RCT outcomes instead of entirely discarding valuable RCT data.
% Additionally, if the features collected during an RCT experiment differ from those in the observational data (e.g., due to outdated RCT experiments, data collected from a different country, etc.), and we aim to estimate treatment outcomes based on the observational data features, our method remains applicable. 
% Therefore, we assume that only the outcomes are accessible in the RCT data.
%, which may also consist of a limited number of samples.
%\begin{figure}[!t]  % Use !t to force the figure at the top
%\centering
%\begin{minipage}{0.74\linewidth}
%\centering
%\includegraphics[width=\textwidth]{./figures/Framework_1.png}
%\end{minipage}
%\hfill
%\begin{minipage}{0.24\linewidth}
%\centering
%\includegraphics[width=\textwidth]{./figures/Inference.png}
%\end{minipage}
%\caption{Left: Training Pipeline. Right: Inference.}
%\label{fig:vis-intro}
%\end{figure}


\textbf{Method.} Our proposed method consists of two regularization modules, based on the given outcomes from RCT data, to regularize the search space of hypothesis to prevent bias due to hidden confounders. We note that the proposed regularization modules are CATE model-agnostic, that is, they can be added to any Neural Net-based CATE estimation model.

Marginals Balancing (MB): The first regularization builds on the key fact that the RCT outcomes can be considered as samples from the true potential outcomes. Motivated by this, we use a pseudo-confounder generator to emulate the hidden confounders, based on which the CATE models' predicted potential outcomes should equal in distribution to the observed outcomes from RCT data.

Projections Balancing (PB): The second approach is based on the observation that the projection of the learned potential outcomes onto any transformation of the features should correspond to that of the true potential outcomes on the same transformation.
 
\begin{figure}
\centering
\includegraphics[width=0.35\textwidth]{./figures/cate_intro.png}
\caption{Comparison of CATE estimates using the factual learner, the MB and PB models, and MB+PB.}
\label{Fig:cate_linear}
\end{figure}
Our final model (MB+PB) combines both approaches, as we numerically observe that doing so restricts the search space for the factual optimization problem and achieves the best performance. We illustrate the performance of these different models on a simple Gaussian linear model in Figure \ref{Fig:cate_linear}. See Section \ref{example:factual_learner_bias} for a full description of this example. 
Figure~\ref{fig:vis-intro} provides a high-level illustration of the proposed approach.
\begin{figure*}[ht]
\centering
\begin{subfigure}{0.74\textwidth}
    \includegraphics[width=\textwidth]{figures/Framework_1.png}
        \caption{Training procedure.}
        \label{fig:fig1train}
\end{subfigure}
\hfill
\begin{subfigure}{0.24\textwidth}
    \includegraphics[width=\textwidth]{figures/Inference.png}
    \caption{Inference procedure.}
            \label{fig:fig1infer}
\end{subfigure}
\caption{Schematic of the proposed training and inference procedures. \textbf{(i)}: (a) generates pseudo confounders that are used within the CATE estimator using the observational data. Potential outcomes are then matched to the unconfounded RCT dataset in (c). \textbf{(ii)}: inference is performed by (a) sampling from the pseudo-confounder generator and (b) using the CATE model with the individual's features.}
\label{fig:vis-intro}

\end{figure*}
 

\paragraph{Related Works}
Several recent works address the challenge of estimating treatment effects under unobserved confounding by combining randomized controlled trials (RCTs) with observational data. Some approaches leverage the internal validity of RCTs and how representative observational data is using techniques such as weighting and doubly robust estimators \citep{colnet2024causal}. Other methods propose a linear correction term to adjust for confounding bias \citep{kallus2018removing}. Methods have also been developed for estimating heterogeneous treatment effects, requiring covariate-level data for improved accuracy and balancing the representation of different observed features~\citep{hatt2022combining}. \citet{kallus2019interval} introduce interval estimation for CATE under unobserved confounders and the marginal sensitivity model \citep{rosenbaum2002}. It is important to note that all of these methods assume that both individual covariates and outcomes from the RCTs are accessible, which differs from the assumptions of our approach, as we assume that only the outcomes of the RCT are observed. Other methods have explored specific scenarios for estimating CATE from multiple datasets, such as in recommendation systems \citep{li2024removing} or sequential observational data \citep{hatt2024sequential}. Moreover, recent works have addressed the confounding introduced by applying representation learning approaches to CATE estimation~\citep{melnychuk2024bounds}. Additionally, our work is closely related to sensitivity analysis under hidden confounders. While previous works~\citep{oprescu2023b,veitch2020sense} study the error in estimating CATE under hidden confounding, we investigate potential improvements in CATE estimation when RCT outcomes are available.




% % Intro Figure
% \begin{minipage}{0.74\linewidth}
% \centering
% \includegraphics[width=\textwidth]{./figures/Framework_1.png}
% \end{minipage}
% \begin{minipage}{0.24\linewidth}
% \centering
% \includegraphics[width=\textwidth]{./figures/Inference.png}
% \end{minipage}
% \begin{figure}[!ht]
% \caption{Left: Training Pipeline. Right: Inference.}
% \label{fig:vis-intro}
% \end{figure}

\section{Problem Setup}
\label{sec:background}



Let $\left(\Omega, \mathcal{F}, \mathbb{P}\right)$ be a probability space. Consider random variables $\left(X, U, T, Y_1, Y_0\right)$ defined on $\left(\Omega, \mathcal{F}, \mathbb{P}\right)$, where $T$ is a binary random variable denoting treatment assignment, $X \in \mathcal{X} \subset \mathbb{R}^d$ represents the observed features and $U \in \mathcal{U} \subset \mathbb{R}^m$ represents unobserved confounders. The potential outcomes $Y_1, Y_0 \in \mathbb{R}$ correspond to the outcomes under treatment and control, respectively. Let $Y$ represent the observed outcome defined as \citep{hernan2020causal}\footnotemark:
\begin{equation*}
Y = T Y_1 + (1 - T) Y_0 .
\end{equation*}

Figure \ref{Fig:conf_graph} illustrates the causal graph of these variables.
% We assume that only samples from $\left(X, T, Y\right)$ are observed, meaning we do not have access to $U, Y_1,$ or $Y_0$. 

\begin{figure}
\centering
\includegraphics[width=0.2\textwidth]{./figures/causal_graph.png}
\caption{Causal graph with unobserved confounders (U).}
\label{Fig:conf_graph}
\end{figure}
\footnotetext{Some references take an alternative approach by first defining the factual outcome and then using the consistency assumption to define the potential outcomes.}

\paragraph{Observational Data.} In real scenarios we do not have access to $U, Y_1,$ or $Y_0$ --- which gives rise to one of the most fundamental challenges in causal inference. Instead, we only have access to samples of the random triplet $\left(X, T, Y\right)$. Thus, we assume an observational dataset $D_{o} = \{\left(x_{i}, t_{i}, y_{i}\right)\}_{i=1}^{n_o}$, consisting of $n_o$ independent observations. 

\paragraph{CATE Estimation.} The objective is to estimate the conditional potential outcomes $\mathbb{E}\left[Y_t \mid X\right]$ for $t \in \{0, 1\}$ and CATE $\tau(X)$, defined as: $$\tau(X) = \mathbb{E}\left[Y_1\mid X\right]-\mathbb{E}\left[Y_0\mid X\right].$$ 
To this end, we make the standard assumption of \emph{positivity}, that is, $P(T = 1 \mid X) > 0$ almost surely. We also assume that $X\ind U$, which is verified by the causal graph in Figure~\ref{Fig:conf_graph}. Moreover, to identify CATE, it is common to assume \emph{conditional unconfoundedness}, that is, $Y_t \ind T \mid X$. While it is well established in the causal inference literature that CATE is identifiable under the assumption of conditional unconfoundedness, this assumption does not hold in the presence of hidden confounders. Without conditional unconfoundedness, CATE is generally not identifiable~\citep{rosenbaum1983,imbens2015causal}. Hidden confounders, which are common in practice, always lead to a violation of the conditional unconfoundedness assumption. Therefore, we focus on scenarios where the conditional unconfoundedness assumption is violated. Specifically, for $t \in \{0,1\}$, we assume $Y_t \not\ind T \mid X$, i.e., the treatment assignment is not independent of the potential outcomes given the observed features due to the presence of unobserved confounders $U$.

% Let $D_{\text{O}} = \{\left(x_{i}, t_{i}, y_{i}\right)\}_{i=1}^{n_o}$ represent an observational dataset consisting of $n_o$ independent observations sampled from $P$.

\paragraph{Performance Metric.} Let $\hat{\tau}(x) = h(x,1)-h(x,0)$ denote an estimator for CATE where $h$ is a hypothesis $h: \mathcal{X}\times\{0,1\}\rightarrow \mathcal{Y}$ that estimates the conditional potential outcomes $\mathbb{E}\left[Y_t|X=x\right]$.
% , and $\hat{\tau}_{ATE}$ represent an estimator for ATE, which can be computed as the empirical mean of $\hat{\tau}(x)$. 
\begin{definition}[PEHE]
\label{def:epehe}
\textit{
The Expected Precision in Estimating Heterogeneous Treatment Effect (PEHE) ~\citep{hill2011} is defined as:
\begin{equation}
\begin{aligned}
    \varepsilon_{\PEHE}(h)&=\int_{\mathcal{X}}(\hat{\tau}(x)-\tau(x))^2 p(x) dx
    %&= \int_{\mathcal{X}}(h(x,1)-h(x,0)-\tau(x))^2 p_\F(x) d x .
\end{aligned}
\end{equation}
where $p(x)$ is the marginal density of the covariates $X$.
}
\end{definition}

The $\varepsilon_{\PEHE}$ is widely used as the performance metric for CATE estimation, especially in scenarios where heterogeneous effects are present across different individuals. 

% Additionally, \emph{we assume access to randomized controlled trial (RCT) outcomes data}, denoted as $\left(Y_{\text{R}}, T_{\text{R}}\right)$, the RCT outcomes can be thought of to be consisting of independent and identically distributed (i.i.d.) samples of the potential outcomes $Y'_1$ and $Y'_0$, corresponding to $Y_1$ and $Y_0$, respectively. We denote the RCT data as $D_{\text{R}} = \left\{ \left( \tilde{y}_j^1, \tilde{y}_k^0 \right) \right\}_{j=1, k=1}^{n_R^1, n_R^0}
% $ which are samples from $\left(Y'_1, Y'_0\right)$, where $n_{R}^1$ and $n_{R}^0$ represent the number of samples in the RCT treatment and control groups, respectively. The central question we aim to address is whether having access to the marginal distributions' samples from the RCT data can help control the estimation error of the conditional potential outcomes and CATE. Additionally, we explore how the confounding degree—i.e., the influence of the unobserved confounder on the treatment assignment—affects the estimation performance. This is formalized in the following paragraph.

\paragraph{RCT Data.} Given that the bias of hidden confounders cannot even be tested with observational data, we assume access to a small batch of RCT data. In particular, we assume access to only the outcomes of RCT data, instead of the stronger requirement of observing covariates. Let the outcome-only RCT data be denoted as $\left(T_{r}, Y_{r}\right)$ and let $u=\mathbb{P}(T_r=1)$. The data generating process of the RCT data is equivalent to the following process: Consider two random variables $Y'_1$ and $Y'_0$ which are equal in distribution to the true potential outcomes $Y_1$ and $Y_0$\footnotetext{This assumption simplifies the theoretical analysis, while our empirical results cover cases with distribution shifts.}, respectively.  Then with probability $u$, we have one sample of $Y'_1$; with probability $1-u$, we have one sample of $Y'_0$. 

We denote the RCT dataset as $D_r= \{D^0_r,D_r^1\}$ where $D^t_r=\{y^t_j\}^{n_r^{t}}_{j=1}$ for $t \in \{0,1\}$. In particular, $D^0_r$ and $D_r^1$ contain $n_r^1$ and $n_r^0$ samples from $Y'_1$ and $Y'_0$. 

% $D_{\text{R}} = \left\{ \left( \tilde{y}_j^1, \tilde{y}_k^0 \right) \right\}_{j=1, k=1}^{n_R^1, n_R^0}
% $ which are samples from $\left(Y'_1, Y'_0\right)$, where $n_{R}^1$ and $n_{R}^0$ represent the number of samples in the RCT treatment and control groups, respectively.
\begin{tcolorbox}[colback=lightgray!20]
The central question we explore in this work is how to apply knowledge about the marginal distributions of the true potential outcomes to help reduce the estimation error of the conditional potential outcomes and CATE under hidden confounders.
\end{tcolorbox}
We note that, to simplify the mathematical analysis, we assume that the potential outcomes in the RCT and observational data are sampled from the same distribution. However, we relax this assumption in our empirical evaluation. Additionally, standard transfer learning and domain adaptation bounds can be derived for scenarios with distribution shifts.

\paragraph{Confounding Degree.} Additionally, we explore how the \emph{confounding degree}—that is the influence of the unobserved confounder on the treatment assignment—affects the estimation performance.  
To quantify the degree of unobserved confounding, we employ the commonly used Marginal Sensitivity Model(MSM)~\citep{rosenbaum2002}. MSM represents a general class of functions that satisfy the \emph{\(\Gamma\)-selection bias condition} defined as follows. 
% Let $U$ denote the unobserved confounder with $U \in \mathcal{U} \subset \mathbb{R}^m$, a $\Sigma-$measurable random variable.

\begin{definition}[\(\Gamma\)-selection bias condition]
\label{def:msm}
\textit{
A probability measure $\mathbb{P}$ satisfies the \(\Gamma\)-selection bias condition with \(1 \leq \Gamma < \infty\) if, for \(\mathbb{P}\)-almost all \(u, \tilde{u} \in \mathcal{U}\) and \(x \in \mathcal{X}\), the following holds:
let $\pi(x,u) = \frac{\mathbb{P}(T = 1 \mid x, U = u)}{\mathbb{P}(T = 0 \mid x, U = u)}$ and $\pi(x,\tilde{u})= \frac{\mathbb{P}(T = 1 \mid x, U = \tilde{u})}{\mathbb{P}(T = 0 \mid x, U = \tilde{u})}$, then \begin{equation}
\frac{1}{\Gamma} \leq \frac{\pi(x,u)}{\pi(x,\tilde{u})} \leq \Gamma.
\end{equation}
}
\end{definition}

The confounding degree is defined as the \emph{minimum value} of $\Gamma$ that satisfies the \(\Gamma\)-selection bias condition. Specifically, the $\Gamma$-selection condition is satisfied when the odds ratio of receiving the treatment can change by up to a factor of $\Gamma$ as the unobserved confounder $U$ varies, while the observed features remain fixed. Note that when $\Gamma = 1$, this corresponds to the case where $U$ has no effect on the likelihood of treatment assignment given the observed features. 
\section{Proposed Approach}
\label{sec:approach}
In this section, we present two models designed to address the challenge of estimating conditional potential outcomes and the CATE in the presence of hidden confounders. 
% First, we outline the issue that arises when the assumption of conditional unconfoundedness is violated, resulting in biased CATE estimates. The baseline model, which relies solely on the observational data, is referred to as the factual learner}}. 
To help understand the challenge of hidden confounders, we first discuss in Section~\ref{sec:factual_learner} with a case study about the issue that arises on the baseline factual learner which relies solely on the observational data in the presence of hidden confounders.
Next, we introduce our two approaches: Marginals Balancing (MB) in Section~\ref{sec:mb} and Projections Balancing (PB) in Section~\ref{sec:pb}.
% , which are specifically developed to mitigate this bias
Both approaches are designed to mitigate bias, though they are based on distinct principles.
Finally, in Section~\ref{sec:algo}, we describe our combined model, MB+PB, which integrates both approaches to improve CATE estimation under hidden confounding. 

\subsection{Factual Learner}
\label{sec:factual_learner}
In the context of conditional potential outcome estimation with observational data, it is standard to solve the following optimization problem based on the observed outcome:
\begin{equation}\label{eqn:std-opt}
\min_{Z_1, Z_0  \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right],
\end{equation}
where $\sigma(X)$ denotes the \emph{$\sigma$-algebra} generated by $X$, $Z_1, Z_0 \in \sigma(X)$, $T\in\{0,1\}$, and $Z_T = Z_1 \mathbbm{1}_{T=1} + Z_0 \mathbbm{1}_{T=0}$. It is well-established ([Theorem 4.1.15]~\citep{durrett2019probability}) that the unique optimal solution (up to a measure zero set) is $$\forall t \in \{0,1\},  Z_t = Z^F_t \triangleq \mathbb{E}\left[Y|X,T=t \right], $$ 
which we will refer to as the factual learner. 
% This result is well-established in learning theory and optimization. 
On the other hand, the goal in causal inference is to learn the conditional potential outcomes $\mathbb{E}\left[Y_t|X\right]$ for $t \in \{0,1\}$, from which CATE can be computed. Note that under conditional unconfoundedness, we have $Z^F_t=
 \mathbb{E}\left[Y_t|X\right]$. 

However, when conditional unconfoundedness is violated, the solution $Z^F_t$  to the standard optimization problem in Equation \ref{eqn:std-opt} does not necessarily equal to $\mathbb{E}\left[Y_t|X\right]$. In other words, the equality $\mathbb{E}\left[Y_t|X\right] = \mathbb{E}\left[Y|X, T=t\right]$ does not necessarily hold. In such cases, the observed data does not provide an accurate estimate of the true treatment effect due to the influence of hidden confounders.

\paragraph{Case Study.} \label{example:factual_learner_bias} To empirically illustrate the bias induced by the factual learner, consider the following example. Let the covariate \( X \) and the hidden confounder \( U \) follow normal distributions where 
\[
X \sim \mathcal{N}(1.0, 0.04)\; \text{and}\; U \sim \mathcal{N}(0, 1).
\]
The treatment assignment \( T \) is determined by a logistic model that depends on both \( X \) and the unobserved confounder \( U \):
\[
P(T = 1 | X, U) = \frac{1}{1 + \exp(-0.5 X - 2 U)},
\]The potential outcomes are modeled as linear functions of \( X \) and \( U \):
\[
Y_1 = -3.5 X + 3 U, \quad Y_0 = 4.5 X -0.6 U.
\]
The observed outcome \( Y \), given by $Y = T Y_1 + (1 - T) Y_0$, depends on the treatment assignment \( T \). 

% In Figure~\ref{Fig:linear_bias}, we plot the results of fitting a linear regression model separately on the treatment group (\( T = 1 \)) and the control group (\( T = 0 \)), allowing us to estimate \( \mathbb{E}[Y | X, T=0] \) and \( \mathbb{E}[Y | X, T=1] \). These estimates are then compared to the true potential outcomes \( \mathbb{E}[Y_t | X] \), which represent the true causal effects. 
We sample $1000$ samples from $(X,T,Y)$, which is more than sufficient for such a simple problem in a low-dimensional setting, and fit two linear regression models separately on the treatment ($T=1$) and control ($T=0$) groups, allowing us to estimate the factual learners \( \mathbb{E}[Y | X, T=0] \) and \( \mathbb{E}[Y | X, T=1] \).
In Figure~\ref{Fig:linear_bias}, we compare the factual learner with the true potential outcomes \( \mathbb{E}[Y_t | X] \).
This comparison reveals the bias inherent in the factual learner due to the unobserved confounder \( U \).
% , highlighting the discrepancy between the observed outcomes and the true potential outcomes.
In the following sections, we propose two different approaches to alleviate the confounding effect when access to the outcomes of an RCT dataset is available.
\begin{figure}
\centering
\includegraphics[width=0.33\textwidth]{./figures/linear_biased_outcomes.png}
\caption{Comparison between the baseline factual learner and the true conditional potential outcomes for a linear Gaussian model. }
\label{Fig:linear_bias}
\end{figure}


\subsection{Marginals Balancing}\label{sec:mb}
\paragraph{Motivation.} To motivate our first model, we begin by observing that the true conditional potential outcomes, 
\(\mathbb{E}[Y_1 | X]\) and \(\mathbb{E}[Y_0 | X]\), should ideally correspond to the projection of a random variable sharing the same distribution as the true potential outcomes \(Y_1\) and \(Y_0\). Specifically, since the true potential outcome \(Y_t\) depends on both the covariates \(X\) and the hidden confounders \(U\), we propose models of the form:
$$
\tilde{Y_t} = f_t(X, \tilde{U}),
$$
where \(f_t: \mathbb{R}^d \times \mathbb{R} \to \mathbb{R}\), and \(\tilde{U} \in \mathbb{R}\) is a random variable representing the \emph{pseudo-confounder}. As  motivated in Section~\ref{sec:background}, given the knowledge of the marginal distribution of $Y_t$ (from the RCT outcomes), it is natural to impose the following constraint: 
\begin{equation}\label{eqn:eq-con}
    \tilde{Y_t} \stackrel{d}{=} Y_t,
\end{equation}
where $\stackrel{d}{=}$ denotes equality in distribution. Thus, the model $\tilde{Y_t}$ should interpolate the observational data under the constraint in Equation~(\ref{eqn:eq-con}).

\paragraph{Method.} Our first approach, which we refer to as the \emph{Marginals Balancing} (MB), follows this observation and can be formalized through the following optimization problem:
\begin{definition}[Optimization Problem of MB] 
\textit{
Let \(\mathcal{B}(\mathbb{R})\) denote the set of real-valued continuous and bounded functions. MB solves the following optimization problem:
    \begin{equation}
    \min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right],
\end{equation}
where, for \(t \in \{0, 1\}\),  
$Z_t = \mathbb{E}\left[f_t(X, \Tilde{U}) | X\right]$ for some function \(f_t: \mathbb{R}^d \times \mathbb{R} \to \mathbb{R}\) and a random variable $\Tilde{U} \in \mathbb{R}$ that conform to the following constraint:
\begin{equation}
\label{eq:mb_constraint}
\forall t \in \{0, 1\}, \forall \tilde{g} \in \mathcal{B}(\mathbb{R}), \quad \mathbb{E}\left[\tilde{g}(f_t(X,\Tilde{U}))\right] = \mathbb{E}\left[\tilde{g}(Y_t)\right].
\end{equation}
}
\end{definition}
% \begin{equation}
%     \min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right],
% \end{equation}
% where, for \(t \in \{0, 1\}\),  
% $Z_t = \mathbb{E}\left[f_t(X, \Tilde{U}) | X\right]$ for some function \(f_t: \mathbb{R}^d \times \mathbb{R} \to \mathbb{R}\) and a random variable $\Tilde{U} \in \mathbb{R}$ that conform to the following constraint:
% \begin{equation}
% \label{eq:mb_constraint}
% \forall t \in \{0, 1\}, \forall \tilde{g} \in \mathcal{B}(\mathbb{R}), \quad \mathbb{E}\left[\tilde{g}(f_t(X,\Tilde{U}))\right] = \mathbb{E}\left[\tilde{g}(Y_t)\right].
% \end{equation}
% Here, \(\mathcal{B}(\mathbb{R})\) denotes the set of real-valued bounded functions. 
Note that the constraint in Equation~(\ref{eq:mb_constraint}) implies the constraint in Equation~(\ref{eqn:eq-con}) due to the Portmanteau Lemma~\citep{billingsley1995probability}. It is important to also note that $\mathbb{E}\left[\tilde{g}(Y_t)\right]$ can be estimated with the outcomes in the RCT data because they can be considered as samples of a random variable $Y'_t$ that equal in distribution to $Y_t$.  

\paragraph{Implementation.} To solve the optimization problem of MB, we generate the pseudo-confounder $\Tilde{U}$ using a neural network $\psi$, and fit a CATE estimation model $\mu_t(X,\Tilde{U})$, with the observed covariates along with the generated pseudo-confounder as inputs, to predict the observed outcomes in the observational dataset $D_o$. Moreover, we enforce that the predicted potential outcomes match the true potential outcomes in distribution. We achieve this by adversarial training, where we instantiate \(\mathcal{B}(\mathbb{R})\) with a neural net, and update its parameter to maximize the $L_2$ distance between the right-hand side and the left-hand side of the equality in Equation~(\ref{eq:mb_constraint}), estimated through the RCT data $D_r$.  
% Finally, we train the entire model by minimizing the factual loss while maximizing the difference between the left-hand side and the right-hand side of Equation~\ref{eq:mb_constraint}.


\paragraph{Empirical Illustration.} Figure~\ref{Fig:deconfounding_linear_a} illustrates the performance of MB model on the case study in Section~\ref{example:factual_learner_bias}. We can observe that the gap between the true conditional potential outcomes and the predicted potential outcomes is indeed reduced compared to the factual learner.

\paragraph{Limitation.} One notable limitation of the marginal balancing method is that the optimal solution to the MB optimization problem is not unique. Moreover, for certain classes of functions, it is possible to construct an optimal solution under the imposed constraint that does not recover the true conditional potential outcomes, as demonstrated by the example provided in Appendix~\ref{proofs}.
\begin{figure}
\centering
\includegraphics[width=.33\textwidth]{UAI2025/figures/linear_model1.png}
\caption{Comparison of the factual learner and MB model with the true conditional potential outcomes.}
\label{Fig:deconfounding_linear_a}
\end{figure}
 
% \textbf{Empirical Illustration.} Figure~\ref{Fig:deconfounding_linear_a} illustrates the performance of MB model on the case study in Section~\ref{example:factual_learner_bias}. We can observe that the gap between the true conditional potential outcomes and the predicted potential outcomes is indeed reduced}} compared to the factual learner.

% \textbf{Limitation.} One notable limitation of the marginal balancing method is that the optimal solution to the MB optimization problem is not unique}}. Moreover, for certain classes of functions, it is possible to construct an optimal solution under the imposed constraint that does not recover the true conditional potential outcomes, as demonstrated by the example provided in Appendix~\ref{proofs}.


\subsection{Projections Balancing}\label{sec:pb}

% \begin{figure}[t]
%     \centering
%     \begin{subfigure}{0.32\textwidth}
%         \includegraphics[width=\textwidth]{UAI2025/figures/linear_model1.png}
%         \caption{Model 1 (MB)}
%         \label{Fig:deconfounding_linear_a}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}{0.32\textwidth}
%         \includegraphics[width=\textwidth]{UAI2025/figures/linear_model2.png}
%         \caption{Model 2 (PB)}
%         \label{Fig:deconfounding_linear_b}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}{0.32\textwidth}
%         \includegraphics[width=\textwidth]{UAI2025/figures/linear_model3.png}
%         \caption{Model 3 (MB + PB)}
%         \label{Fig:deconfounding_linear_c}
%     \end{subfigure}
    
%     \caption{Estimating Conditional Potential Outcomes under hidden confounding.}
%     \label{Fig:deconfounding_linear}
% \end{figure}

We now introduce our second approach, called \emph{Projections Balancing} (PB). 

To illustrate the benefits of this method, we begin by considering an idealized scenario with direct access to the true potential outcomes $Y_1$ and $Y_0$, rather than relying on the RCT data containing samples of $Y'_1$ and $Y'_0$ which are random variables equal in distribution to $Y_1$ and $Y_0$. In practice, this is unattainable since the treatment assignment biases the distribution of the observed outcomes in observational data. We will later relax this learner under the assumption that only a small subset of RCT outcomes is available. 

We begin with the following result, which presents a constrained optimization problem whose \emph{unique optimal solution is precisely the conditional potential outcome $\mathbb{E}[Y_t|X]$}, the quantity we aim to identify in causal inference.
% The following proposition presents the ideal  optimization problem under the assumption that the true potential outcomes are known.
\begin{proposition}[Ideal PB]
\label{thm:ideal_pb}
Let $\mathcal{G} = \big\{g:\mathbb{R}\to [-1,1], \text{such that $g$ is piece-wise continuous}\big\}$ and consider the following optimization problem:
$$
\min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right]
$$
subject to the constraint
$$
\forall g \in \mathcal{G}, \forall t \in \{0, 1\}, \quad \mathbb{E}\left[Z_t g(X)\right] = \mathbb{E}\left[Y_t g(X)\right].
$$
The unique solution for this problem is:
$$
\forall t \in \{0, 1\}, \quad Z_t = \mathbb{E}\left[Y_t \mid X\right].
$$
\end{proposition}
\begin{proof}[Proof of Proposition~\ref{thm:ideal_pb}]
See in Appendix~\ref{proofs}.
\end{proof}

\textbf{Method.} We underscore that the most notable advantage of the ideal PB learner is that \emph{it provides a unique solution corresponding to the true potential outcomes}. Without access to the true potential outcomes in practice, we now introduce a practical PB learner by relaxing the proposed ideal PB learner to scenarios where RCT outcomes are available. 
% Specifically, the Practical Projections Balancing approach modifies the constraint in proposition \ref{thm:ideal_pb}, and the optimization objective becomes:
% \begin{equation}
%     \min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \left( \mathbb{E}\left[\left(Z_T - Y\right)^2\right] + \alpha \sum_{t=0}^{1} \sup_{g \in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] \right| \right)
% \end{equation}
% where $\alpha \in \mathbb{R}^+$ is a regularization parameter. 
\begin{definition}[Optimization Problem of PB] 
\textit{ Let $\mathcal{C} \in \mathbb{R}^+$ be a positive constant and $\mathcal{G} = \big\{g:\mathbb{R}\to [-1,1]\big\}$. PB has the following optimizing problem:
\begin{equation}\label{eqn:pb-learner}
\begin{aligned}
   &\min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right]; \\
   &\mathrm{s.t.} \max_{t \in \{0,1\}} \sup_{g \in \mathcal{G}} \big| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] \big| \le \mathcal{C},
\end{aligned}
\end{equation}
where $Y_t'$ is a random variable equal in distribution to the true potential outcome $Y_t$.
}
\end{definition}
In this formulation, the true potential outcomes \( Y_t \) are replaced by the RCT potential outcomes \( Y'_t \). However, since this problem is challenging to optimize, in practice, we employ the optimization duality and optimize the following optimization problem with a penalty term: 
\begin{equation}
\begin{aligned}
    & \min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} 
    \left( \mathbb{E}\left[\left(Z_T - Y\right)^2\right] \right . \\
    & \left .+ \alpha \sum_{t=0}^{1} \sup_{g \in \mathcal{G}} \big| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] \big| \right)
\end{aligned}
\end{equation}
where $\alpha \in \mathbb{R}^+$ is a regularization parameter. We now provide a theoretical guarantee for the PB learner in Equation~(\ref{eqn:pb-learner}), which characterizes the deviation of the predicted conditional potential outcomes from the true conditional potential outcomes.

\begin{proposition}
[Practical Projections Balancing (PB)]
\label{thm:practical_pb}
Let $t\in \{0,1\}$ and define
$$
L_p(Z_t) = \sup_{g \in \mathcal{G}} \;\; | \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_tg(X)\right]|
$$
with $Y'_t \overset{d}{=} Y_t $ and $Y'_t \ind Y_t$. 
We have that,
\begin{equation}\label{eqn:error-ub}
    \mathbb{E}\left[|Z_t - \mathbb{E}[Y_t|X]|\right] \leq L_p(Z_t) + \sqrt{\mathrm{Var}(Y_t)},
\end{equation}
where $\sqrt{\mathrm{Var}(Y_t)}$ represents the standard deviation of the potential outcomes. 
\end{proposition}
\begin{proof}[Proof of Proposition~\ref{thm:practical_pb}]
See in Appendix~\ref{proofs}.
\end{proof}

\begin{remark}
In particular, Equation(~\ref{eqn:error-ub}) provides an upper bound on the error of potential outcome estimation of any estimator $Z_t$. It implies that an estimator with low value of $L_p(Z_t)$ is a good estimator of the true conditional potential outcomes. To this end, note that $L_p(Z_t)$ measures how well the estimator $Z_t$ conforms the PB constraint in Equation~(\ref{eqn:pb-learner}). Thus, a solution to the PB optimization has guaranteed performance. Given that CATE under hidden confounders is not identifiable under general conditions, we conjecture that the standard deviation term in the error bound may not be further reduced due to the \emph{inherent stochasticity of $Y_t$} and the \emph{confounding effects of hidden confounders}.   
\end{remark}

\begin{figure}
\centering
\includegraphics[width=0.33\textwidth]{UAI2025/figures/linear_model2.png}
\caption{Comparison of the factual learner and PB model with the true conditional potential outcomes.}
\label{Fig:deconfounding_linear_b}
\end{figure}


% \begin{remark}
%   In particular, Equation~\ref{eqn:error-ub} provides an upper bound on the error of potential outcome estimation}} of any estimator $Z_t$. It implies that an estimator with low value of $L_p(Z_t)$ is a good estimator of the true conditional potential outcomes. To this end, note that $L_p(Z_t)$ measures how well the estimator $Z_t$ conforms the PB constraint in Equation~\ref{eqn:pb-learner}. Thus, a solution to the PB optimization has guaranteed performance}}. 
% Given that CATE under hidden confounders is not identifiable under general conditions, we conjecture that the standard deviation $\sqrt{\mathrm{Var}(Y_t)}$ in the error bound may not be further reduced due to the confounding effects of hidden confounders and inherent stochasticity of $Y_t$.    
% \end{remark}

% This deviation is controlled by two terms: \( L_p(f_t) \), which captures the discrepancy between the predicted potential outcomes \( Z_t \) and the RCT outcomes \( Y'_t \), and $\sqrt{\mathrm{Var}(Y_t)}$ represents the standard deviation of the potential outcomes. 
%\sup_{g\in \mathcal{G}} | \cov(Y_t, g(X))|$, which represents the maximal covariance between the true potential outcomes and the transformation of the covariates under $\mathcal{G}$. 

\paragraph{Empirical Illustration.} Figure \ref{Fig:deconfounding_linear_b} illustrates the performance of this model on the synthetic linear example in Section~\ref{example:factual_learner_bias}. We can observe that the gap between the true conditional potential outcomes and the predicted potential outcomes is reduced compared to the factual learner.

% A key limitation of practical projections balancing is that it does not guarantee recovery of the true conditional potential outcomes.

\subsection{Algorithm: Marginals + Projections Balancing}\label{sec:algo}
In this section, we present our proposed approach to combine both the Marginals Balancing and Projections Balancing, entitled MB+PB. The rationale behind the effectiveness of our approach is to restrict the search space for the factual optimization objective and to push the solution to get as close as possible to the true conditional potential outcomes. 

\paragraph{Optimization Objective.} The objective function for MB+PB is the following:
\begin{equation*}
    \min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \left(\mathbb{E}\left[\left(Z_T - Y\right)^2\right] + \alpha \sum_{t=0}^1 \mathcal{L}_t(f_t)\right),
\end{equation*}
where
\begin{equation}\label{eqn:reg}
\begin{aligned}
    \mathcal{L}_{t}\left(f_t\right) & = \sup_{g \in \mathcal{G}} \Big\| \mathbb{E}\left[f_t\left(X,\Tilde{U}\right)g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right]\Big\|\\ 
    & + \sup_{\tilde{g}\in \mathcal{B}}\Big\| \mathbb{E}\left[\tilde{g}(f_t(X,\Tilde{U})\right] - \mathbb{E}\left[\tilde{g}(Y'_t )\right]\Big\|
\end{aligned}
\end{equation}
and  
$Z_t = \mathbb{E}\left[f_t\left(X,\Tilde{U}\right)| X \right]$
for some function $f_t$ and a random variable $\Tilde{U}$.

\paragraph{Empirical Illustration.} Figure \ref{Fig:deconfounding_linear_c} illustrates the performance of this model on the case study in Section~\ref{example:factual_learner_bias}. We observe that the gap between the true conditional potential outcomes and the predicted potential outcomes is almost entirely reduced. Comparing with the performance of applying MB and PB individually in Figure~\ref{Fig:deconfounding_linear_a} and~\ref{Fig:deconfounding_linear_b}, MB+PB demonstrates significantly superior performance. Motivated by this, we opt for MB+PB as our final approach. 

\paragraph{Training.} We now present below the general procedure to train the model MB+PB for a general class of functions. For all pseudo-code details, check Algorithm \ref{alg:mb_pb}.
\begin{figure}
\centering
\includegraphics[width=0.33\textwidth]{UAI2025/figures/linear_model3.png}
\caption{Comparison of the factual learner and MB+PB model with the true conditional potential outcomes.}
\label{Fig:deconfounding_linear_c}
\end{figure}
% \textbf{Illustration.} Figure \ref{Fig:deconfounding_linear_c} illustrates the performance of this model on the synthetic linear example \ref{example:factual_learner_bias}. We observe that the gap}} between the true conditional potential outcomes and the predicted potential outcomes is almost entirely reduced}}. Comparing with performance of applying MB}} and PB}} individually in Figure~\ref{Fig:deconfounding_linear_a} and~\ref{Fig:deconfounding_linear_b}, MB+PB demonstrats significantly superior performance}}. Motivated by this, we opt for MB+PB as our final approach. 

% \textbf{Method.} We now present the general procedure to train the model MB+PB for a general class of functions. For all pseudo-code details, check Algorithm \ref{alg:mb_pb}.

\begin{enumerate}
    \item Pseudo-Confounder Generation. We generate Gaussian noise $\eta \in \mathbb{R}^l \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)$, where $l$ is the dimension of the generated noise. The noise is passed through a neural network generator $\psi$, and we set $\Tilde{U} = \psi\left(\eta\right)$. 
    \item Potential Outcomes Estimation. Both the features $X$ and the generated pseudo-confounder $\Tilde{U}$ are fed into a neural network-based conditional potential outcomes learner $f_t$ to have the predicted potential outcome $f_t(X,\Tilde{U})$. 
    % e.g., TARNet~\citep{shalit}. 
    \item Balancing. Meanwhile, the predicted potential outcomes $f_1(X, \Tilde{U})$ and $f_0(X, \Tilde{U})$ are balanced with the RCT outcomes $Y'_1$ and $Y'_0$, respectively, through the regularization defined in Equation~(\ref{eqn:reg}).
\end{enumerate}

% \begin{enumerate}
%     \item Pseudo-Confounder Generation.}} We generate Gaussian noise $\eta \in \mathbb{R}^l \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)$, where $l$ is the dimension of the generated noise. The noise is passed through a neural network generator $\psi$, and we set $\Tilde{U} = \psi\left(\eta\right)$. 
%     \item Potential Outcomes Estimation.}} Both the features $X$ and the generated noise $\Tilde{U}$ are fed into a neural network-based conditional potential outcomes learner $\mu_t(X,\Tilde{U})$. 
%     % e.g., TARNet~\citep{shalit}. 
%     \item Balancing.}} Meanwhile, the predicted potential outcomes $\mu_1(X, \Tilde{U})$ and $\mu_0(X, \Tilde{U})$ are balanced with the RCT outcomes $Y'_1$ and $Y'_0$, respectively, using both MB and PB approaches.
% \end{enumerate}


% While the proposed constrained optimization is complex, we leave open the conjecture of whether minimizing the PB+MB loss recovers the true conditional potential outcomes for Gaussian linear functions. Furthermore, how much does this method reduce the gap between predicted and true outcomes for general functions? These questions are left for future research.

\section{Empirical Results}
\begin{figure*}[t]
    \centering
    \begin{subfigure}{0.3\linewidth}
        \centering
        \includegraphics[width=0.95\linewidth]{UAI2025/figures/rct_size_25.png}
        \caption{}
        \label{fig:rct_25}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\linewidth}
        \centering
        \includegraphics[width=0.95\linewidth]{UAI2025/figures/rct_size_50.png}
        \caption{}
        \label{fig:rct_50}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.3\linewidth}
        \centering
        \includegraphics[width=0.95\linewidth]{UAI2025/figures/rct_size_100.png}
        \caption{}
        \label{fig:rct_100}
    \end{subfigure}
    \caption{Comparison of $\sqrt{\varepsilon_{\text{PEHE}}}$ across different RCT and observational data sample sizes. Baseline: Factual Learner, MB+PB: Combined Marginals and Projections Balancing, and RCT-Oracle. The size of the baseline and RCT-Oracle is equal to the sum of the RCT samples and the observational data size.}
    \label{fig:size}
\end{figure*}
\label{sec:empirical_results}
%In this section, we present experiments to evaluate the performance of MB+PB}} on synthetic and real-world datasets.
\subsection{Synthetic Experiments}
\label{subsec:synthetic_experiments}

Following~\cite{kallus2019interval}, we begin our empirical evaluation with a synthetic example, which allows us to control the confounding degree based on a parameter $\Gamma$ of MSM (defined in Section~\ref{def:msm}) and \emph{explore the effect of varying levels of hidden confounding} on the estimation of CATE. 

\paragraph{Data Generating Process.} We consider an one-dimensional example to illustrate the influence of unobserved confounding on estimating CATE. In this example, we generate an unobserved binary confounder \( U \sim \text{Bern}(1/2) \), which is independent of other variables, and a covariate \( X \sim \text{Unif}[-2, 2] \). The nominal propensity score is defined as \( e(x) = \sigma(0.75x + 0.5) \), where \( \sigma(\cdot) \) is the logistic sigmoid function. To investigate the impact of confounding, we consider a sensitivity parameter \( \Gamma \) and define the complete propensity score as:
\begin{equation}
e(x, u) = u \cdot \alpha_t(x; \Gamma) + (1 - u) \cdot \beta_t(x; \Gamma),
\end{equation}
with 
$
\alpha_t(x; \Gamma) = \left( \frac{1}{\Gamma \cdot e(x)} \right) + 1 - \frac{1}{\Gamma},
$
and,
$
\beta_t(x; \Gamma) = \left( \frac{\Gamma}{e(x)} \right) + 1 - \Gamma.
$

Moreover, the treatment assignment \( T \) is sampled as \( T \sim \text{Bern}(e(X, U)) \). This structure ensures that the complete propensity scores attain the extremal marginal sensitivity model (MSM) bounds corresponding to \( \Gamma \) (see~\citep{kallus2019interval} for more details). 
The outcome model is chosen to exhibit a nonlinear CATE, incorporating both linear confounding terms and a noise component \( \varepsilon \sim \mathcal{N}(0, 1) \). Specifically, the potential outcome \( Y_t \) is  defined as:
$$
\begin{aligned}
Y_t & = (2t - 1)X + 2 (2t - 1) - 2 \sin(2(2t - 1)X) \\
&\quad  - 2(2U - 1)(1 + 0.5X) + \varepsilon.
\end{aligned}
$$
\begin{figure}
\centering
\includegraphics[width=0.4\textwidth]{./figures/epehe_vs_confounding.png}
\caption{$\sqrt{\varepsilon_{\text{PEHE}}}$ for different confounding degrees. Baseline: Factual Learner, MB: Marginals Balancing, PB: Projections Balancing, MB+PB: Combined Marginals and Projections Balancing, RCT-Oracle: Using a large RCT dataset with covariates, and Obs-Oracle: Using the observational dataset without hidden confounders.}
\label{Fig:conf_degree}
\end{figure}
\paragraph{Results.} The results are illustrated in Figure~\ref{Fig:conf_degree}. In particular, with increasing confounding level measured by $\log(\Gamma)$, methods such as MB, PB, and the baseline show a marked increase in estimation error. However, MB+PB demonstrates strong robustness and maintains lower errors even at high confounding levels. This suggests that our approach is better equipped to handle the adverse effects of hidden confounders, which is crucial when the confounding degree is unknown. Notably, domain knowledge can only provide very coarse estimations of the confounding degree. %Thus, having a model that remains stable despite such uncertainty is particularly appealing in practical applications.  

\paragraph{Influence of RCT Data Size}: In Figure \ref{fig:size}, we observe that after using only $50$ RCT data points in addition to more than $1000$ observational data points, the performance of MB+PB stabilizes. This shows that our model requires only a small number of RCT points to achieve enhanced performance, without requiring  the covariates information of RCT data. 
Even with as few as $25$ data points (the sum of both control and treatment units), we can see improved performance over the biased factual learner. It is important to note that this improvement is not observed when RCT points are simply added to the observational data, even when their features are included in training. 



\subsection{Real Data Application}
Following the setting of~\cite{hatt2022combining},
we apply MB+PB to three real-world datasets. We briefly describe them below, with more details deferred to Appendix~\ref{sec:original_data}.

\textbf{STAR}: A randomized study from $1985$ investigating the effect of class size (treatment) on students' standardized test scores (outcome). Following \citep{kallus2018removing}, we obtain a dataset with 8 covariates for $4,139$ students: $1,774$ in small classes and $2,365$ in regular classes.

\textbf{ACTG}: A clinical trial on the effects of different treatments for HIV-1 patients with CD4 counts of $200$-$500$ cells/mm³. The outcome is the change in CD4 counts after $20 \pm 5$ weeks. 

\textbf{NSW}: An RCT studying the effect of job training on income (\citep{lalonde1986evaluating}. Following~\cite{smith2005does}, we combine 465 randomized subjects (297 treated, 425 control) with 2,490 observational controls, including 8 covariates.

Following the setting in~\cite{hatt2022combining}, the original dataset is used to estimate pseudo-true potential outcomes, which we treat as the ground truth. Confounding bias is introduced by dropping instances based on outcome thresholds. Further details are in Appendix~\ref{sec:new_data}. The RCT data points are sampled from a distributionally different population from the observational population, increasing selection bias. Despite this, our method remains robust.

\begin{table}[htbp]
\centering
\caption{Comparison of $\sqrt{\epsilon_\PEHE}$ across three real-world datasets. Results are
presented for $10$ runs.}
\label{tab:real_world_results}

\begin{tabular}{|l|c|c|c|}
\hline
 & \multicolumn{3}{c|}{$\sqrt{\epsilon_\PEHE}$ (Mean $\pm$ Std)} \\
\hline
 \textbf{Estimator} & \textbf{STAR} & \textbf{ACTG} & \textbf{NSW} \\
\hline
2-step ridge & 3.01 $\pm$ 0.01 & 1.51 $\pm$ 0.01 & 2.82 $\pm$ 0.02 \\
2-step RF & 3.14 $\pm$ 0.03 & 1.58 $\pm$ 0.07 & 3.10 $\pm$ 0.12 \\
2-step NN & 3.03 $\pm$ 0.02 & 1.60 $\pm$ 0.02 & 2.82 $\pm$ 0.02 \\
Baseline & 2.66 $\pm$ 0.01 & 1.08 $\pm$ 0.04 & 0.85 $\pm$ 0.04 \\
CorNet & 0.59 $\pm$ 0.01 & 0.42 $\pm$ 0.06 & 0.14 $\pm$ 0.07 \\
CorNet$^+$ & 0.38 $\pm$ 0.07 & \textbf{0.27} $\pm$ 0.03 & 0.21 $\pm$ 0.08 \\
MB+PB & \textbf{0.36} $\pm$ 0.04 & 0.52 $\pm$ 0.05 & \textbf{0.08} $\pm$ 0.02 \\
\hline
\end{tabular}
\end{table}


\textbf{Results.} To assess the effectiveness of our approach in utilizing RCT data, we compare it with the factual learner (\emph{Baseline}) which trains only on observational data, and with methods that use covariate information from RCT data, including \emph{2-step ridge}, \emph{2-step RF}, and \emph{2-step NN} from~\cite{kallus2018removing}, and CorNet models (\emph{CorNet} and \emph{CorNet+}), developed by \cite{hatt2022combining}. Table \ref{tab:real_world_results} shows that models such as \emph{2-step ridge}, \emph{2-step RF}, and \emph{2-step NN} underperform due to the high variance introduced by inverse propensity score re-weighting, as noted in~\cite{hatt2022combining}. The CorNet models perform significantly better and are comparable to our approach MB+PB. We emphasize that our MB+PB model relies solely on RCT data outcomes yet still achieves competitive results, outperforming CorNet in two of the three total tasks.

% demonstrating that MB+PB can provide robust performance with less data.

\section{Conclusion}
In this work, we introduced two approaches, Marginals Balancing (MB) and Projections Balancing (PB),  to address the challenge of CATE estimation under hidden confounders. By leveraging outcome-only RCT data, we demonstrated how these models mitigate bias from unobserved confounders, outperforming benchmark methods. The combination of MB and PB (MB+PB) leads to further enhanced performance across synthetic and real-world datasets. While our methods show promising empirical results, we aim to pursue a deeper theoretical understanding of the proposed methods in future works.

\acknowledgements{Ahmed Aloui, Juncheng Dong, and Vahid Tarokh were supported in part by the National Science Foundation (NSF) under the National AI Institute for Edge Computing Leveraging Next Generation Wireless Networks Grant \#  2112562. We also thank Prof. Galen Reeves for valuable discussions.}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.5\linewidth]{figures/nsf_logo.png}
\end{figure}

% \paragraph{Ethics Statement.}
% This work focuses on improving the design of machine learning models for estimating treatment effects. We do not foresee any immediate ethical concerns.

% \paragraph{Reproducibility Statement.}
% We have provided detailed information on how the datasets are processed and how the models are trained, including hyperparameters values. Additionally, we have included the implementation of our algorithms in Python in the supplementary material.


\bibliography{uai2025-template.bib}

\newpage

\onecolumn

\title{Conditional Average Treatment Effect Estimation Under Hidden Confounders\\(Supplementary Material)}
\maketitle

\appendix


\section{Proofs of Theoretical Results}
\label{proofs}
We begin by presenting an example demonstrating that the optimal solution for the Marginals Balancing objective is not necessarily the true conditional potential outcomes. We then proceed to provide propositions that support the use of the Projections Balancing method.
\paragraph{Example}
Consider the random variables \(T, X, Y_0, Y_1\), where \(T\) is a binary treatment indicator, \(X \in \mathcal{X}\), and \(Y_0, Y_1\) are the potential outcomes. We aim to minimize the following MB objective:

\[
\mathbb{E}\left[\left(1-T\right) \left(\mathbb{E}[\tilde{Y}_0 \mid X] - Y_0 \right)^2  + T \left( \mathbb{E}[\tilde{Y}_1 \mid X] - Y_1 \right)^2 \right],
\]
subject to the constraint that \( \tilde{Y}_0 \overset{d}{=} Y_0 \) and \( \tilde{Y}_1 \overset{d}{=} Y_1 \).

Suppose \(X \sim \text{Ber}(1/2)\) and \(T \sim \text{Ber}(1/2)\), with \(T\) and \(X\) being independent. Define the potential outcomes as:
\[
Y_0 = Y_1 = (1 - T)X + T(1 - X).
\]

Now, consider the random variables \(\tilde{Y}_0 = X\) and \(\tilde{Y}_1 = 1 - X\). We observe that both \(\tilde{Y}_0\) and \(\tilde{Y}_1\) satisfy the equality in distribution constraint: \(\tilde{Y}_0 \overset{d}{=} Y_0\) and  \(\tilde{Y}_1 \overset{d}{=} Y_1\). 

Furthermore, we have:
\[
\mathbb{E}[\tilde{Y}_0 \mid X] (1-T) = X (1-T) = Y_0 (1-T),
\]
and
\[
\mathbb{E}[\tilde{Y}_1 \mid X] T = (1 - X) T = Y_1 T .
\]

Therefore, the MB objective is minimized, and the objective value is zero. While we have that for the true conditional potential outcomes $\mathbb{E}\left[Y_1|X\right]$ and $\mathbb{E}\left[Y_0|X\right]$, we have that:
\[
\begin{aligned}
\mathbb{E}\left[Y_1|X\right] 
& = \mathbb{E}\left[(1 - T)X \mid  X \right] + \mathbb{E}\left[T(1 - X) \mid \right] \\
& = \mathbb{E}\left[1-T\right] \mathbb{E}\left[X \mid X\right] + \mathbb{E}\left[T\right] \mathbb{E}\left[\left(1-X\right) \mid X\right] \\
& = \frac{1}{2} X + \frac{1}{2} \left(1-X\right) \\
\end{aligned}
\]
Therefore, 
\[
\mathbb{E}\left[Y_1|X\right] = \frac{1}{2}, \quad \quad \mathbb{E}\left[Y_0|X\right] = \frac{1}{2}
\]
Which does not achieve a zero loss for the objective.

\begin{proposition1}[Ideal Potential outcomes learner 2]
Let $\left(\Omega, \mathcal{F}, \mathbb{P}\right)$ be a probability space. Consider the real random variables $\left(X, U, T, Y_0, Y_1\right)$, where $T$ is a binary random variable, and $Y_1,Y_0 \ind T |\left(X,U\right)$, $Y$ is defined as $Y = T Y_1 + (1 - T) Y_0$. We also assume that $X\ind U$. We aim to solve the following optimization problem:
$$
\min_{Z_1, Z_0 \; \sigma(X)\text{-measurable}} \mathbb{E}\left[\left(Z_T - Y\right)^2\right]
$$
subject to the constraint
$$
\forall g: \mathbb{R}\to [-1,1], \forall t \in \{0, 1\}, \quad \mathbb{E}\left[Z_t g(X)\right] = \mathbb{E}\left[Y_tg(X)\right].
$$
The unique solution for this problem is
$$
\forall t \in \{0, 1\}, \quad Z_t = \mathbb{E}\left[Y_t \mid X\right].
$$

\end{proposition1}

\begin{proof}[Proof of Proposition~\ref{thm:ideal_pb}]$\;$\\
We begin with the following identities for the observed and predicted outcomes:
\[
Y = T Y_1 + (1 - T) Y_0, \quad Z_T = T Z_1 + (1 - T) Z_0.
\]
Thus, the objective function can be expanded as:
\[
\begin{aligned}
\mathbb{E}\left[\left(Z_T - Y\right)^2\right] 
& = \mathbb{E}\left[\left(T(Z_1 - Y_1) + (1-T)(Z_0 - Y_0)\right)^2\right] \\
& = \mathbb{E}\left[T(Z_1 - Y_1)^2 + (1-T)(Z_0 - Y_0)^2\right] \\
& \quad + 2 \mathbb{E}\left[T (1-T) (Z_1 - Y_1)(Z_0 - Y_0)\right].
\end{aligned}
\]
Since \( T \in \{0,1\} \), we have \( T(1 - T) = 0 \), so the cross term vanishes:
\[
\mathbb{E}\left[T(1-T)(Z_1 - Y_1)(Z_0 - Y_0)\right] = 0.
\]
Thus, the objective simplifies to:
\[
\mathbb{E}\left[\left(Z_T - Y\right)^2\right] = \mathbb{E}\left[T(Z_1 - Y_1)^2\right] + \mathbb{E}\left[(1-T)(Z_0 - Y_0)^2\right].
\]
Next, we can analyze the optimization for \( Z_1 \) and \( Z_0 \) separately. Without loss of generality, we first focus on \( Z_1 \).

We expand the term for \( Z_1 \):
\begin{align*}
\mathbb{E}[T(Z_1 - Y_1)^2] 
& = \mathbb{E}[T(Z_1 - \mathbb{E}[Y_1 \mid X] + \mathbb{E}[Y_1 \mid X] - Y_1)^2]  \\
& = \underbrace{\mathbb{E}\left[T(Z_1 - \mathbb{E}[Y_1 \mid X])^2\right]}_{\text{Minimized at zero when } Z_1 = \mathbb{E}[Y_1 \mid X]} + \underbrace{\mathbb{E}\left[T(\mathbb{E}[Y_1 \mid X] - Y_1)^2\right]}_{\text{Independent of the optimization objective}}  \\
& \quad + \underbrace{2 \mathbb{E}\left[T (Z_1 - \mathbb{E}[Y_1 \mid X]) (\mathbb{E}[Y_1 \mid X] - Y_1)\right]}_{\text{We prove this term is zero below}} 
\end{align*}

Since \( Y_1 \ind T \mid (X, U) \), we have:
\[
\mathbb{E}\left[T (Z_1 - \mathbb{E}[Y_1 \mid X]) (\mathbb{E}[Y_1 \mid X] - Y_1)\right] = \mathbb{E}\left[(Z_1 - \mathbb{E}[Y_1 \mid X])\pi(X, U) \Psi(U) \right],
\]
where \( \pi(X, U) = \mathbb{E}[T \mid X, U] \in (0, 1) \) and \( \Psi(U) = -\mathbb{E}[Y_1 \mid U] \). Let \( A = \{ \omega \mid Z_1 - \mathbb{E}[Y_1 \mid X] > 0 \} \) and \( B = \{ \omega \mid \Psi(U) > 0 \} \).

We decompose the expectation as follows:
\[
\begin{aligned}
\mathbb{E}\left[\pi(X, U) \Psi(U) (Z_1 - \mathbb{E}[Y_1 \mid X])\right] 
& = \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \\
& \quad + \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A^C \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \\
& \quad + \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \\
& \quad + \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A^C \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right]
\end{aligned}
\]

We now handle each of these four terms separately:

Case 1 \(\left( A \cap B \right)\):

This term is positive, as both \( Z_1 - \mathbb{E}[Y_1 \mid X] > 0 \) and \( \Psi(U) > 0 \), and since $X\ind U$, we have that:
\[
\begin{aligned}
0 \leq \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] 
& \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_{A \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right]. \\
& \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_B\right] \mathbb{E}\left[(Z_1 - \mathbb{E}[Y_1 \mid X])\mathbbm{1}_{A}\right] \\
& \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_B\right] (\mathbb{E}\left[Z_1 \mathbbm{1}_{A}\right] -\mathbb{E}\left[\mathbb{E}[Y_1 \mathbbm{1}_{A}\mid X]\right]) \\
& \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_B\right] \left(\mathbb{E}\left[Z_1\mathbbm{1}_{A}\right] -\mathbb{E}[Y_1 \mathbbm{1}_{A}]\right)
\end{aligned}
\]
However, since \( \mathbbm{1}_{A} \) is \( \sigma(X) \)-measurable, we can write it as a function of $X$, more precisely we can choose $g$ to be, $g_A(X) = \mathbbm{1}\left(X\in A\right)$, therefore,
\begin{align*}
0 & \leq \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \\
& \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_B\right] \left(\mathbb{E}\left[Z_1g_{A}(X)\right] -\mathbb{E}[Y_1 g_{A}(X)]\right) 
 = 0
\end{align*}

% Hence the inner term equals zero:
% \[
% \mathbb{E}\left[\mathbb{E}\left[(Z_1 \mathbbm{1}_{A} - Y_1 \mathbbm{1}_{A}) \mid X \right]\right] = 0,
% \]
% Therefore we have that,
% \[

Case 2 \( \left(A^C \cap B \right)\):

In this case, \( Z_1 - \mathbb{E}[Y_1 \mid X] \leq 0 \) and \( \Psi(U) > 0 \), making this term non-positive:
\[
0 \geq \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A^C \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \geq \mathbb{E}\left[\Psi(U) \mathbbm{1}_{A^C \cap B} (Z_1 - \mathbb{E}[Y_1 \mid X])\right].
\]
Again, by the same reasoning as in Case 1, we have:
\[
\mathbb{E}\left[\mathbb{E}\left[(Z_1 \mathbbm{1}_{A^C} - Y_1 \mathbbm{1}_{A^C}) \mid X \right]\right] = 0,
\]
so this term is also zero.

Case 3 \( \left(A \cap B^C \right)\):

Here, \( Z_1 - \mathbb{E}[Y_1 \mid X] > 0 \) but \( \Psi(U) \leq 0 \), so this term is non-positive:
\[
0 \geq \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \geq \mathbb{E}\left[\Psi(U) \mathbbm{1}_{A \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right].
\]
As in the previous cases, we factor out \( \mathbb{E}\left[Z_1 \mathbbm{1}_{A} - Y_1 \mathbbm{1}_{A} \mid X \right] = 0 \), so this term is zero.

Case 4 \( \left(A^C \cap B^C \right)\) :

Finally, in this case, both \( Z_1 - \mathbb{E}[Y_1 \mid X] \leq 0 \) and \( \Psi(U) \leq 0 \), so the term is positive:
\[
0 \leq \mathbb{E}\left[\pi(X, U) \Psi(U) \mathbbm{1}_{A^C \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right] \leq \mathbb{E}\left[\Psi(U) \mathbbm{1}_{A^C \cap B^C} (Z_1 - \mathbb{E}[Y_1 \mid X])\right].
\]
Once again, we apply the same reasoning, and the term equals zero:
\[
\mathbb{E}\left[\mathbb{E}\left[(Z_1 \mathbbm{1}_{A^C} - Y_1 \mathbbm{1}_{A^C}) \mid X \right]\right] = 0.
\]

Thus, each of the four terms is equal to zero. Therefore, the entire expression simplifies to zero:
\[
2 \mathbb{E}\left[T (Z_1 - \mathbb{E}[Y_1 \mid X]) (\mathbb{E}[Y_1 \mid X] - Y_1)\right] = 0.
\]

A symmetric argument holds for \( Z_0 \). By expanding \( \mathbb{E}\left[(1 - T)(Z_0 - Y_0)^2\right] \), we can use the same reasoning to show that \( Z_0 = \mathbb{E}[Y_0 \mid X] \) minimizes the objective function.

We now observe that \( \mathbb{E}\left[(1 - T)(Z_0 - Y_0)^2\right] \), and \( Z_0 = \mathbb{E}[Y_0 \mid X] \) verify the constraint as we have for every $g\in \mathcal{G}$:
\[
\begin{aligned}
\mathbb{E}\left[\mathbb{E}\left[Y_t \mid X\right] g(X) \right] 
& = \mathbb{E}\left[\mathbb{E}\left[Y_t g(X) \mid X\right]  \right] \\
& = \mathbb{E}\left[Y_t g(X)\right] \\
\end{aligned}
\]

Combining these results, we conclude the minimizer of the objective function must satisfy:
\[
Z_1 = \mathbb{E}[Y_1 \mid X] \quad \text{and} \quad Z_0 = \mathbb{E}[Y_0 \mid X].
\]

\end{proof}

\begin{proposition2}[Relaxed potential outcomes learner (PB)]
Let $\mathcal{G} = \{g: \mathbb{R}^d \rightarrow [-1,1]\}$ and let,
\[
L_p(Z_t) = \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] \right|
\]
with \( Y'_t \overset{d}{=} Y_t \) and \( Y'_t \ind Y_t \). 
Then,
\[
\mathbb{E}\left[|Z_t - \mathbb{E}[Y_t \mid X]|\right] \leq L_p(Z_t) + \sqrt{Var(Y_t)}.
\]
\end{proposition2}

\begin{proof}[Proof of Proposition~\ref{thm:practical_pb}]$\;$\\
First define
\[
L_I(Z_t) = \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y_t g(X)\right] \right|.
\]
We will first prove that
\[
\mathbb{E}\left[|Z_t - \mathbb{E}[Y_t \mid X]|\right] \leq L_I(Z_t).
\]
Since \( Z_t - \mathbb{E}[Y_t \mid X] \) is \(\sigma(X)\)-measurable, let \( A = \{\omega \in \Omega \mid Z_t - \mathbb{E}[Y_t \mid X] > 0\} \) and \( B = \{\omega \in \Omega \mid Z_t - \mathbb{E}[Y_t \mid X] \leq 0\} \). We can then define a function \( \Tilde{g} \in \mathcal{G} \) such that \( \Tilde{g} = \mathbbm{1}_A - \mathbbm{1}_B \). We have:
\[
\begin{aligned}
    \left| \mathbb{E}\left[Z_t \Tilde{g}(X)\right] - \mathbb{E}\left[Y_t \Tilde{g}(X)\right] \right| 
    &= \left| \mathbb{E}\left[(Z_t - Y_t) \Tilde{g}(X)\right] \right| \\
    &= \left| \mathbb{E}\left[\mathbb{E}\left[\left(Z_t - Y_t\right) \Tilde{g}(X) \mid X\right]\right] \right| \\
    &= \left| \mathbb{E}\left[\mathbb{E}\left[(Z_t - Y_t) \mid X\right] \Tilde{g}(X)\right] \right| \\
    &= \mathbb{E}\left[\left| \mathbb{E}\left[Z_t - Y_t \mid X\right] \mathbbm{1}_A \right|\right] + \mathbb{E}\left[\left| \mathbb{E}\left[Z_t - Y_t \mid X\right] \mathbbm{1}_B \right|\right]  \quad (A \cup B = \Omega) \\ 
    &= \mathbb{E}\left[\left| Z_t - \mathbb{E}[Y_t \mid X] \right|\right].
\end{aligned}
\]
Since we have
\[
\left| \mathbb{E}\left[Z_t \Tilde{g}(X)\right] - \mathbb{E}\left[Y_t \Tilde{g}(X)\right] \right| \leq \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y_t g(X)\right] \right|,
\]
it follows that
\[
\mathbb{E}\left[\left| Z_t - \mathbb{E}[Y_t \mid X] \right|\right] \leq L_I(Z_t).
\]
Next, we observe:
\[
\begin{aligned}
    L_I(Z_t) &= \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] + \mathbb{E}\left[Y'_t g(X)\right] - \mathbb{E}\left[Y_t g(X)\right] \right| \\
    &\leq \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Z_t g(X)\right] - \mathbb{E}\left[Y'_t g(X)\right] \right| + \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Y'_t g(X)\right] - \mathbb{E}\left[Y_t g(X)\right] \right| \\
    &\leq L_p(Z_t) + \sup_{g\in \mathcal{G}} \left| \mathbb{E}\left[Y'_t\right] \mathbb{E}\left[g(X)\right] - \mathbb{E}\left[Y_t g(X)\right] \right| \\
    &= L_p(Z_t) + \sup_{g\in \mathcal{G}} \left| \mathbb{E}[Y_t] \mathbb{E}[g(X)] - \mathbb{E}[Y_t g(X)] \right| \\
    &= L_p(Z_t) + \sup_{g\in \mathcal{G}} \left| \cov(Y_t, g(X)) \right| \\
    & \leq L_p(Z_t) + \sqrt{\mathrm{Var}(Y_t)} \sup_{g\in \mathcal{G}}\sqrt{\mathrm{Var}(g(X))} \quad \text{(Cauchy-Schwarz)} \\
    &\leq  L_p(Z_t) + \sqrt{\mathrm{Var}(Y_t)} \quad \text{(Popoviciu's inequality)}
\end{aligned}
\]
Thus, we conclude:
\[
\mathbb{E}\left[\left| Z_t - \mathbb{E}[Y_t \mid X] \right|\right] \leq L_p(Z_t) + \sqrt{Var(Y_t)}.
\]
\end{proof}

\section{Datasets Description}

\subsection{The Original Datasets}
\label{sec:original_data}
\paragraph{Tennessee Student/Teacher Achievement Ratio (STAR) Experiment}
This experiment, initiated in 1985, was designed as a randomized trial to investigate the impact of class size (i.e., the treatment) on students' standardized test performance (i.e., the outcome). At the beginning of the study, students and teachers were randomly allocated to different class sizes, with efforts to maintain these class sizes throughout the experiment. This dataset has been used previously by \citet{kallus2018removing} to address bias from unmeasured confounding in observational studies. 

In line with \citet{kallus2018removing}, we focus on two treatment conditions: small classes (\(13\)-\(17\) students) and regular-sized classes ($22$-$25$ students). The treatment variable is the class size to which students were assigned in the first grade, comprising a total of \(4,509\) students. The outcome variable \( Y \) is measured as the aggregate score from listening, reading, and mathematics standardized tests administered at the end of the first grade. In addition to class size and test scores, the dataset includes several covariates for each student: gender, race, birth month, birth date, birth year, eligibility for free lunch, rural/urban status, and teacher identification number. After excluding students with incomplete data, the resulting sample consists of \(4,139\) students, with \(1,774\) assigned to the treatment group (small classes, \( T = 1 \)) and \(2,365\) to the control group (regular classes, \( T = 0 \)). We sample 

\paragraph{AIDS Clinical Trial Group (ACTG) Study 175}
The AIDS Clinical Trial Group (ACTG) Study $175$ was a randomized clinical trial conducted to compare four treatment regimens on $2,139$ HIV-1-infected patients with CD4 counts between 200 and 500 cells/mm\(^3\) \citep{hammer1996}. The trial compared the effectiveness of zidovudine (ZDV) monotherapy, didanosine (ddI) monotherapy, ZDV combined with ddI, and ZDV combined with zalcitabine (ZAL). This dataset was also used in \citet{hatt2022generalizing} to study the problem of learning policies that generalize to target populations, making it a challenging candidate for evaluating our method due to underrepresentation of certain subgroups, such as HIV-positive females, in clinical trials \citep{gandhi2005eligibility, greenblatt2011priority}.

The outcome \( Y \) in this dataset is defined as the change in CD4 count from the start of the study to $20 \pm 5$ weeks later. The estimated average treatment effects for male and female subgroups are \(-8.97\) and \(-1.39\), respectively \citep{hatt2022generalizing}, indicating a notable difference in treatment response between genders. We focus on two treatment arms: the combined ZDV and ZAL treatment (\( T = 1 \)) and ZDV monotherapy (\( T = 0 \)). The dataset comprises \(1,056\) patients with \(12\) covariates, including five continuous variables: age (years), weight (kg, denoted as wtkg), baseline CD4 count (cells/mm\(^3\)), Karnofsky score (\(0-100\) scale, denoted as karnof), and baseline CD8 count (cells/mm\(^3\)). All continuous variables are centered and scaled prior to analysis. The dataset also includes seven binary covariates: gender (\(1 =\) male, \(0 =\) female), homosexual activity (homo, \(1 =\) yes, \(0 =\) no), race (\(1 = \) nonwhite, \(0 =\) white), intravenous drug use history (drug, \(1 =\) yes, \(0 =\) no), symptomatic status (symptom, \(1 =\) symptomatic, \(0 = \) asymptomatic), antiretroviral experience (str2, \(1 =\) experienced, \(0 =\) naive), and hemophilia (hemo, \(1 =\) yes, \(0 =\) no).

\paragraph{National Supported Work (NSW) Demonstration}
The National Supported Work (NSW) Demonstration was a subsidized work program that ran for four years across \(15\) locations in the United States, providing participants with transitional work experience and assistance in securing regular employment. From April \(1975\) to August \(1977\), the NSW program operated as a randomized experiment in \(10\) locations, with some applicants randomly assigned to a control group that did not participate in the program. Data for \(6,616\) treatment and control observations were collected through retrospective baseline interviews and four follow-up interviews, covering a two-year period before randomization and up to \(36\) months afterward.

For our analysis, we use a randomized dataset from \citet{lalonde1986evaluating}, following the setup of \citet{smith2005does}. We combine randomized samples from 465 subjects (297 treated and 425 controls) with 2,490 control samples from the Panel Study of Income Dynamics (PSID) to create an observational dataset. The resulting dataset consists of 297 treated observations (\( T = 1 \)) and 2,915 control observations (\( T = 0 \)). This study includes 8 covariates: age, education level, ethnicity (represented as two variables), marital status, and educational attainment.

\subsection{Generating Small Randomized Outcomes and Large Observational Datasets}
\label{sec:new_data}
In line with the method used by \citet{kallus2018removing,hatt2022combining} we generate a large observational dataset with confounding and a smaller unconfounded randomized dataset consisting solely of the outcomes, both derived from the real-world data described in Section~\ref{sec:original_data}. Importantly, the randomized dataset is drawn from a different population than the observational one, reflecting the limitations of randomized controlled trials (RCTs) in generalizing to the broader population of interest.

To do this, we follow the same procedure for the STAR, ACTG, and NSW datasets. First, we generate a small, unconfounded randomized dataset by sampling a small fraction of the RCT data points $128,50,50$. instances from the original dataset. We introduce a distributional discrepancy between the randomized and observational datasets by selecting individuals for the randomized dataset based on a covariate (``birthday'' for STAR, ``gender'' for ACTG, and ``age'' for NSW), see~\citep{hatt2022combining} for further details. Second, we create the observational dataset by introducing unobserved confounding, ensuring that the treatment and control groups differ systematically in their potential outcomes. Following \citet{kallus2018removing}, we select subjects from those who were not included in the randomized dataset: controls (\(T = 0\)) with especially low outcomes (i.e., \( y_i < \mathbb{E}[Y \mid T = 0] - c \cdot \sigma_{Y \mid T=0} \), where \( \sigma_{Y \mid T=0} \) is the standard deviation of the outcomes in the control group) and treated subjects (\(T = 1\)) with notably high outcomes (i.e., \( y_i > \mathbb{E}[Y \mid T = 1] + c \cdot \sigma_{Y \mid T=1} \), where \( \sigma_{Y \mid T=1} \) is the standard deviation of the outcomes in the treatment group).

The constant \(c\) is adjusted according to the size of the original dataset (with \(c = 1\) for STAR, \(c = 0\) for ACTG, and \(c = 0.25\) for NSW) to control the number of subjects in the observational dataset, ensuring that it remains large. This process introduces confounding by selectively including control subjects with lower outcomes and treated subjects with higher outcomes into the observational treatment and control groups. As a result, a naïve estimator relying solely on the observational data will be biased. Moreover, because this selection is based on the outcome variable, it becomes impossible to control for this confounding.

\section{Implementation Details}

\begin{algorithm}[t]
   \caption{Training Algorithm for Marginals and Projections Balancing (MB+PB)}
   \label{alg:mb_pb}
\begin{algorithmic}[1]
   \STATE {\bfseries Input:} 
   $D_{o} = \{(x_i, t_i, y_i)\}_{i=1}^{n_o}$, $D_r= \{D^0_r,D_r^1\}$ where $D^t_r=\{y^t_j\}^{n_r^{t}}_{j=1}$ for $t \in \{0,1\}$, initial and final weights $(\alpha_{s}, \alpha_{e})$, number of epochs $N_2$, balancing iterations $N_{b}$, neural networks for: potential outcomes ($\mu$), marginals balancing ($\tilde{g}$), and projections balancing ($g$).
   \STATE {\bfseries Output:} Trained models $\mu$ and $\psi$.
   
   \STATE Initialize noise $\eta \sim \mathcal{N}\left(\mathbf{0}_l, \mathbf{I}_l\right)$ and generate $n_o$ samples $\{\eta_i\}_{i=1}^{n_o}$.
   
   \FOR{$\text{epoch} = 1$ to $N_{1}$}
       \STATE Increase $\alpha$ from $\alpha_{s}$ to $\alpha_{e}$.
       \STATE Generate noise $\tilde{u}_i = \psi(\eta_i)$ and estimate outcomes $\hat{y}_i = \mu_{t_i}(x_i,\tilde{u}_i)$ for all $1\leq i \leq n_o$.
       \STATE Compute factual loss:
       \[
       \mathcal{L}_f = \frac{1}{n_o} \sum_{i=1}^{n_o} \left(t_i \left(y_i - \hat{y}_i\right)^2 + (1-t_i)\left(y_i - \hat{y}_i\right)^2\right)
       \]
       \STATE Generate potential outcomes $\hat{y}^1_i = \mu_1(x_i, \tilde{u}_i)$ and $\hat{y}^0_i = \mu_0(x_i, \tilde{u}_i)$.
       \STATE Compute marginals balancing loss:
       \[
       \mathcal{L}_m = \left(\frac{1}{n^1_r} \sum_{i=1}^{n^1_r} \tilde{g}(y^1_i) - \frac{1}{n_o} \sum_{i=1}^{n_o} \tilde{g}(\hat{y}^1_i)\right)^2 
       + \left(\frac{1}{n^0_r} \sum_{i=1}^{n^0_r} \tilde{g}(y^0_i) - \frac{1}{n_o} \sum_{i=1}^{n_o} \tilde{g}(\hat{y}^0_i)\right)^2
       \]
       \STATE{Compute projections balancing loss:
       \[
       \mathcal{L}_p = \left(\frac{1}{n^1_r} \sum_{i=1}^{n^1_r} g(x_{\lambda(i)})y^1_i - \frac{1}{n_o} \sum_{i=1}^{n_o} g(x_i)\hat{y}^1_i\right)^2 
       + \left(\frac{1}{n^0_r} \sum_{i=1}^{n^0_r} g(x_{\lambda(i)})y^0_i - \frac{1}{n_o} \sum_{i=1}^{n_o} g(x_i)\hat{y}^0_i\right)^2
       \]
       where $\lambda(i)$ selects a random number between $1$ and $n_o$.}
       \STATE Compute total loss $\mathcal{L} = \mathcal{L}_f + \alpha (\mathcal{L}_m + \mathcal{L}_p)$
       \STATE Backpropagate to update $\mu$ and $\psi$ using Adam.
       
       \FOR{each balancing iteration $n = 1$ to $N_{\text{balancing}}$}
          \STATE Calculate the negative regularization loss:
          $
          \mathcal{L}_r = - (\mathcal{L}_m + \mathcal{L}_p)
          $
          \STATE Backpropagate to update $\tilde{g}$ and $g$ using Adam. 
       \ENDFOR
   \ENDFOR
   \STATE Return trained models $\{\mu_t\}_{t=0}^1$, and $\psi$.
\end{algorithmic}
\end{algorithm}
In this section, we provide the implementation details of our proposed algorithm MB+PB. Specifically, we describe the neural network architectures used for the different modules in our algorithm. Additionally, we present a detailed pseudo-code for the training procedure.

\paragraph{The Neural Networks Architectures.}  As detailed in Section~\ref{sec:algo}, MB+PB consists of three components: a generator $\psi(\eta)$, a CATE learner $\mu_t(X,\Tilde{U})$, a marginals balancing module $\tilde{g}$, and a projections balancing module $g$.

\begin{itemize}
    \item \textbf{Pseudo-Confounder Generator:} The generator $\psi(\eta)$ is a neural network designed to generate pseudo-confounders from the input variables, which consist of standard Gaussian noise. The network architecture consists of two fully connected layers with 16 hidden units and ELU activation functions.

    \item \textbf{CATE Learner:} The CATE learner is modeled as an S-Learner $\mu_t(X, \Tilde{U})$ and is implemented using a neural network with three fully connected layers. The first two layers have $32$ hidden units, each followed by an ELU activation function. The final layer outputs a scalar, representing the estimated potential outcome.

    \item \textbf{MB Module:} The marginals balancing module $\tilde{g}$ is modeled as a neural network with two hidden layers, each containing 8 hidden units. ReLU activation functions are applied to the hidden layers, and the output is constrained between $-1$ and $1$ or $0$ and $1$, using either a tanh or a sigmoid activation function, respectively.

    \item \textbf{PB Module:} The projections balancing module $g$ is also modeled as a neural network with two hidden layers, each containing 8 hidden units. ReLU activation functions are applied to the hidden layers, and the output is constrained between $-1$ and $1$ or $0$ and $1$, using either a tanh or a sigmoid activation function, respectively.  
\end{itemize}
We use the same neural network architectures for all of our results presented in the Experiments Section~\ref{sec:empirical_results}.

\paragraph{The Algorithm.} We present the full pseudo-code for MB+PB in Algorithm~\ref{alg:mb_pb}. The code consists of the training loop of the proposed model and the loss functions computation.

\paragraph{Hyperparameters.} 
For the regularization parameter $\alpha$ is set dynamically, following the heuristic described below. We initially start with a small value for $\alpha$, and as the observed factual loss optimization stabilizes, we gradually increase the importance of the regularization term. In all of our experiments, we train for $2000$ epochs. Specifically, we set $\alpha = 0.01$ for the first $1230$ epochs, then linearly increase $\alpha$ from $0.01$ to $100$ between epochs $1230$ and $1430$. From epoch $1430$ to $2000$, we train the model with the high regularization term $\alpha = 100$. Additionally, as described in Algorithm~\ref{alg:mb_pb}, there are multiple balancing steps involved in training the MB+PB constraint. To increase the efficiency of our training process, we begin with a small number of balancing iterations (5) when $\alpha$ is small, and increase this number to 50 as $\alpha$ becomes large. Note that we use the same training strategy across all the datasets to avoid fine-tuning the hyperparameter and to have a better assessment of the presented algorithm. For the learning rates of the different neural networks they are all set at $0.001$ and we use Adam as an optimizer. Finally, for the batch sizes, we use a batch size of $256$, $200$, and $200$ for STAR, ACTG, and NSW respectively.


\paragraph{Computational Resources}
The experiments in this paper are not computationally expensive to conduct and were performed on the following GPU: NVIDIA GeForce RTX 3090.




\section{Additional Results}
Here we include additional empirical results.

\subsection{Synthetic Example}
We begin by presenting additional results for the synthetic experiment discussed in the main text, following the approach of \citet{kallus2019interval}. In Figure~\ref{fig:synthetic-1d-epochs}, we report the $\sqrt{\varepsilon_{\PEHE}}$ as a function of training epochs. Additionally, the results for the factual loss across varying degrees of confounding are provided in Figure~\ref{fig:conf_factual}.

\begin{figure}[ht]
    \centering
    \begin{subfigure}{0.32\linewidth}
        \centering
        \includegraphics[width=\linewidth]{UAI2025/figures/kallus_1d_ln_gamma_1.png}
        \caption{Training $\sqrt{\varepsilon_{\PEHE}}$ for $log(\Gamma) = 1.0$}
        \label{fig:pehe-loss_1}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\linewidth}
        \centering
        \includegraphics[width=\linewidth]{UAI2025/figures/kallus_1d_ln_gamma_3.png}
        \caption{Training $\sqrt{\varepsilon_{\PEHE}}$ for $log(\Gamma) = 3.0$}
        \label{fig:pehe-loss_3}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.32\linewidth}
        \centering
        \includegraphics[width=\linewidth]{UAI2025/figures/kallus_1d_ln_gamma_5.png}
        \caption{Training $\sqrt{\varepsilon_{\PEHE}}$ for $log(\Gamma) = 5.0$}
        \label{fig:pehe-loss5}
    \end{subfigure}
    \caption{Comparison of $\sqrt{\varepsilon_{\text{PEHE}}}$ across training epochs for different levels of confounding ($\log(\Gamma)$).}
    \label{fig:synthetic-1d-epochs}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.5\linewidth]{UAI2025/figures/factual_vs_confounding.png}
    \caption{Factual loss comparison across different degrees of confounding.}
    \label{fig:conf_factual}
\end{figure}

\subsection{Factual Loss Comparison Across Real-World Datasets}

\begin{table}[ht]
\centering
\caption{Comparison of the factual loss $\epsilon_{\F}$ (Mean $\pm$ Std) across three real-world datasets. Results are
presented for $10$ runs.}
\label{tab:real_factual}

\begin{tabular}{|l|c|c|c|}
\hline
 & \multicolumn{3}{c|}{$\epsilon_\F$ (Mean $\pm$ Std)} \\
\hline
\textbf{Estimator} & \textbf{STAR} & \textbf{ACTG} & \textbf{NSW} \\
\hline
Baseline & 1.3$\pm$ 0.02 & 1.26 $\pm$ 0.05 & 0.38$\pm$ 0.02 \\
MB+PB (Ours) & \textbf{1.08} $\pm$ 0.13 & \textbf{0.72} $\pm$ 0.03 & \textbf{0.17} $\pm$ 0.01 \\
\hline
\end{tabular}
\end{table}

Table~\ref{tab:real_factual} presents a comparison of the factual loss, $\epsilon_{\F}$, measured as the mean and standard deviation over 10 runs for three real-world datasets: STAR, ACTG, and NSW. We note that while the baseline model is designed to estimate the factual outcome, it may suffer from distributional shift as the domain of the features of the test data is different from that of the train data. Hence, learning a better causal model in that case yields better factual estimates. We conjecture that this enhanced performance is explained by the fact that our model learns a better model which makes it more robust to distributional shifts, as was formalized by ~\citep{richens2024robust}.

The baseline estimator is compared against our method, MB+PB. The results demonstrate the superiority of MB+PB in terms of lower factual loss, particularly for the STAR and NSW datasets. This reduction in factual loss indicates that our method is more effective at aligning the model predictions with the observed outcomes, thereby mitigating the effects of confounding and improving the estimation of potential outcomes.

For the STAR dataset, our method achieves a mean factual loss of $1.08 \pm 0.13$, outperforming the baseline, which has a loss of $1.3 \pm 0.02$. Similarly, the NSW dataset shows a significant improvement with MB+PB, resulting in a mean loss of $0.17 \pm 0.01$ compared to the baseline loss of $0.38 \pm 0.02$. However, for the ACTG dataset, both methods exhibit relatively close performance, with MB+PB slightly outperforming the baseline by reducing the mean loss from $1.26 \pm 0.05$ to $0.72 \pm 0.03$.

These results confirm that the MB+PB method is more robust across different datasets compared to the naive factual learner, even in terms of factual loss when there is a distributional shift, which is prevalent in real-world scenarios.

%\section{Schematic Illustration of the Proposed Approach}  


\end{document}


% \section{Introduction}\label{sec:intro}
% UAI 2025 papers have to be prepared using \LaTeX.
% To start writing your paper, copy \texttt{uai2025-template.tex} and replace title, authorship, and content with your own.

% The UAI 2025 paper style is based on a custom \textsf{uai2025} class.
% The class file sets the page geometry and visual style.\footnote{%
%     The class uses the packages \textsf{adjustbox}, \textsf{environ}, \textsf{letltxmacro}, \textsf{geometry}, \textsf{footmisc}, \textsf{caption}, \textsf{textcase}, \textsf{titlesec}, \textsf{titling}, \textsf{authblk}, \textsf{enumitem}, \textsf{microtype}, \textsf{lastpage}, and \textsf{kvoptions}.
% }
% The class file also loads basic text fonts.\footnote{%
%     Fonts loaded are \textsf{times} (roman), \textsf{helvet} (sanserif), \textsf{courier} (fixed-width), and \textsf{textcomp} (common symbols).
% }
% \emph{You may not modify the geometry or style in any way, for example, to squeeze out a little bit of extra space.}
% (Also do not use \verb|\vspace| for this.)
% Feel free to use convenience functionality of loaded packages such as \textsf{enumitem}.
% The class enables hyperlinking by loading the \textsf{hyperref} package.

% You are free to load any packages available in \TeX{Live}~2020 that are compatible with the UAI class.\footnote{In case this template or your submission does not compile, always first make sure your \TeX\ installation is up-to-date.}
% (Mik\TeX{} and Mac\TeX{} generally contain the same packages.)
% Do not load conflicting packages—you will get an error message—, as this complicates creating the proceedings.
% Please avoid using obsolete commands, such as \verb|\rm|, and obsolete packages, such as \textsf{epsfig}.\footnote{%
%     See \url{https://ctan.org/pkg/l2tabu}.
% }

% \swap[ ]{in the header of your source file.}{Feel free to include your own macros}

% \section{General Formatting Instructions}
% As a general rule: \emph{follow the template}.

% \subsection{Authorship}
% Reviewing is double-blind.
% However, you can already fill in your author names and affiliations in the \verb|\author| block in the preamble following the example of the template because the class will remove it as long as the option \textsf{accepted} is not passed to the class.
% Nevertheless, make sure any other information in the paper does not disclose your identity, for example URLs to supplementary material.

% \subsection{Sectioning}
% Three numbered sectioning commands are provided: \verb|\section|, \verb|\subsection|, and \verb|\subsubsection|.
% Please respect their order, so do not put a \verb|\subsubsection| directly beneath a \verb|\section|.
% One unnumbered sectioning command is provided, \verb|\paragraph|.
% It can be used directly below any numbered section level.
% Do not use any other sectioning commands.

% \subsubsection{Typing the Section Titles}
% The \verb|\section| and \verb|\subsection| titles are uppercased by the class.
% Please type them in title case.
% (This is used in the PDF bookmarks.)
% Please also write the \verb|\subsubsection| titles in title case.

% \paragraph{What is title case?}
% \href{https://en.wikipedia.org/wiki/Title_case}{Wikipedia} explains:
% \begin{quote}
%     Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English.
%     When using title case, all words are capitalized except for ‘minor’ words (typically articles, short prepositions, and some conjunctions) unless they are the first or last word of the title.
% \end{quote}

% \subsection{References, Citations, Footnotes}\label{sec:etc}
% \subsubsection{Cross-Referencing}
% Always use \verb|\label| and \verb|\ref|—or a command with a similar effect—when cross-referencing.
% For example, this subsection is Section~\ref{sec:etc}.

% \subsubsection{Citations}
% Citations should include the author's last name and year.
% They should be part of the sentence.
% An example parenthetical citation: “Good introductions to the topic are available \citep{latexcompanion}.”
% An example textual citation: “\citet{einstein} discusses electrodynamics of moving bodies.”
% Do not use a parenthetical citation where a textual one is appropriate.
% An example of what \emph{not} to do: “\citep{einstein} discusses electrodynamics of moving bodies.”

% We strongly advise to use reference list software such as Bib\TeX{} and a citation package such as \textsf{natbib}.
% The reference style you use should be compatible with the author-year citations.
% Both the citation style and reference style used should be consistent.

% For the original submission, take care not to reveal the authors' identity through the manner in which one's own previous work is cited.
% For example, writing
% “I discussed electrodynamics of moving bodies before \citep{einstein}.” would be inappropriate, as it reveals the author's identity.
% Instead, write “\citet{einstein} discussed electrodynamics of moving bodies.”

% \subsubsection{Footnotes}
% You can include footnotes in your text.\footnote{
%     Use footnotes sparingly, as they can be distracting, having readers skip back and forth between the main text and the foot of the page.
% }
% The footnote mark should follow the fragment to which it refers, so a footnote\footnote{
%     A footnote is material put at the foot of a page.
% }
% for a word has a footnote mark attached to that word and a footnote for a phrase or sentence has a footnote mark attached to the closing punctuation.

% \section{Math}\label{sec:math}
% The class file does not load any math support package like \textsf{amsmath}\footnote{%
%   See the \textsf{amsmath} documentation at \url{https://ctan.org/pkg/amsmath} for further details.
% }.
% We advise using the \textsf{mathtools}\footnote{%
%   See the \textsf{mathtools} documentation at \url{https://ctan.org/pkg/mathtools} for further details.
% }
% package, which extends \textsf{amsmath} with fixes and even more useful commands.
% Feel free to load other support packages for symbols, theorems, etc.

% Use the \textsf{amsmath} environments for displayed equations.
% So, specifically, use the \texttt{equation} environment instead of \verb|$$...$$| and the \texttt{align} environment instead of \texttt{eqnarray}.\footnote{For reasons why you should not use the obsolete \texttt{eqnarray} environment, see Lars Madsen, \textit{Avoid eqnarray!} TUGboat 33(1):21--25, 2012.}
% An \texttt{equation}:
% \begin{equation}\label{eq:example}
%   0 = 1 - 1.
% \end{equation}
% Two \texttt{align}'ed equations:
% \begin{align*} % no numbers with starred version
%   1 + 2 &= 3,\\
%   1 - 2 &= -1.
% \end{align*}
% Equations can also be put inline, of course.
% For example, Equation~\eqref{eq:example}: \(0=1+1\). % $0=1+1$ also works
% (Notice that both inline and displayed math are part of the sentence, so punctuation should be added to displayed math.)

% The \textsf{amsmath} and \textsf{mathtools} packages provide a lot of nice functionality, such as many common math operators, e.g., \(\sin\) and \(\max\), and also commands for defining new ones.

% \section{Floats}\label{sec:floats}
% Floats, such as figures, tables and algorithms, are moving objects and are supposed to float to the nearest convenient location.
% Please do not force them to go in the middle of a paragraph.
% They must respect the column width.

% Two-column floats are possible.
% They appear at the top of the next page, so strategic placement may be necessary.
% For an example, see Figure~\ref{fig:tikz}.
% They may not enter the margins.
% \begin{figure*}
%     \centering
%     \begin{tikzpicture}[xscale=1.5]
%         \coordinate (origin);
%         \draw[->] (origin) -- +(1cm,0) node[below] {$x$};
%         \draw[->] (origin) -- +(0,1cm) node[left] {$y$};
%         \fill[gray] (45:1cm) circle[radius=.2cm];
%     \end{tikzpicture}
%     \caption{A Nice Filled Ellipse with a Pair of Coordinate Axes.}\label{fig:tikz}
% \end{figure*}

% All material in floats should be legible and of good quality.
% So avoid very small or large text and pixelated or fuzzy lines.

% \subsection{Figures}\label{sec:figures}
% Figures should go in the \texttt{figure} environment and be centered therein.
% The caption should go below the figure.
% Use \verb|\includegraphics| for external graphics files but omit the file extension.
% Supported formats are \textsf{pdf} (preferred for vector drawings and diagrams), \textsf{png} (preferred for screenshots), and \textsf{jpeg} (preferred for photographs).
% Do not use \verb|\epsfig| or \verb|\psfig|.
% If you want to scale the image, it is better to use a fraction of the line width rather than an explicit length.
% For example, see Figure~\ref{fig:city}.

% \begin{figure}[!htb]
%   \centering
%   \includegraphics[width=\linewidth]{rio.jpg}
%   \caption{A View of a Nice City.}\label{fig:city}
% \end{figure}

% Do not use \verb|\graphicspath|.
% If the images are contained in a subdirectory, specify this when you include the image, for example \verb|\includegraphics{figures/mypic}|.

% \subsection{Tables}\label{sec:tables}
% Tables should go in the \texttt{table} environment and be centered therein.
% The caption should go above the table and be in title caps.
% For an example, see Table~\ref{tab:data}.
% \begin{table}
%     \centering
%     \caption{An Interesting Table.}\label{tab:data}
%     \begin{tabular}{rl}
%       \toprule % from booktabs package
%       \bfseries Dataset & \bfseries Result\\
%       \midrule % from booktabs package
%       Data1 & 0.12345\\
%       Data2 & 0.67890\\
%       Data3 & 0.54321\\
%       Data4 & 0.09876\\
%       \bottomrule % from booktabs package
%     \end{tabular}
% \end{table}

% \subsection{Algorithms}\label{sec:algorithms}
% You can load your favorite algorithm package, such as \textsf{algorithm2e}\footnote{See the \textsf{algorithm2e} documentation at \url{https://ctan.org/pkg/algorithm2e}.}.
% Use the environment defined in the package to create a centered float with an algorithm inside.

% \section{Back Matter}
% There are a some final, special sections that come at the back of the paper, in the following order:
% \begin{itemize}
%   \item Author Contributions (optional)
%   \item Acknowledgements (optional)
%   \item References
% \end{itemize}
% They all use an unnumbered \verb|\subsubsection|.

% For the first two special environments are provided.
% (These sections are automatically removed for the anonymous submission version of your paper.)
% The third is the ‘References’ section.
% (See below.)

% (This ‘Back Matter’ section itself should not be included in your paper.)


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

% \begin{acknowledgements} % will be removed in pdf for initial submission,
% 						 % (without ‘accepted’ option in \documentclass)
%                          % so you can already fill it to test with the
%                          % ‘accepted’ class option
%     Briefly acknowledge people and organizations here.

%     \emph{All} acknowledgements go in this section.
% \end{acknowledgements}

% % References
% \bibliography{uai2025-template}

% \newpage

% \onecolumn

% \title{Title in Title Case\\(Supplementary Material)}
% \maketitle



% This Supplementary Material should be submitted together with the main paper.

% \appendix
% \section{Additional simulation results}
% Table~\ref{tab:supp-data} lists additional simulation results; see also \citet{einstein} for a comparison. 

% \begin{table}[!h]
%     \centering
%     \caption{An Interesting Table.} \label{tab:supp-data}
%     \begin{tabular}{rl}
%         \toprule % from booktabs package
%         \bfseries Dataset & \bfseries Result\\
%         \midrule % from booktabs package
%         Data1 & 0.12345\\
%         Data2 & 0.67890\\
%         Data3 & 0.54321\\
%         Data4 & 0.09876\\
%         \bottomrule % from booktabs package
%     \end{tabular}
% \end{table}

% \section{Math font exposition}
% % NOTE: necessary when ptmx or no mathfont class option is given
% \providecommand{\upGamma}{\Gamma}
% \providecommand{\uppi}{\pi}
% How math looks in equations is important:
% \begin{equation*}
%     F_{\alpha,\beta}^\eta(z) = \upGamma(\tfrac{3}{2}) \prod_{\ell=1}^\infty\eta \frac{z^\ell}{\ell} + \frac{1}{2\uppi}\int_{-\infty}^z\alpha \sum_{k=1}^\infty x^{\beta k}\mathrm{d}x.
% \end{equation*}
% However, one should not ignore how well math mixes with text:
% The frobble function \(f\) transforms zabbies \(z\) into yannies \(y\).
% It is a polynomial \(f(z)=\alpha z + \beta z^2\), where \(-n<\alpha<\beta/n\leq\gamma\), with \(\gamma\) a positive real number.


% \end{document}
