
\section{Experiments}\label{sec:eval} %
To fairly benchmark the core novelties of this paper, we compared variants of MPPI using the same dynamics models, and hyper-parameters tuned for the baseline algorithm that uses no partial identification. The MPPI without partial identification is referred to as ``naive'' as it is unaware of hidden confounding. Algorithm~\ref{alg:mppi} shows MPPI augmented with our proposed sensitivity analysis. Similarly, MPPI can be augmented with the sensitivity model that has been studied in numerous recent works~\citep{frauen2024sharp,kausik2024offline,bennett2024efficient}, inspired by the classic MSM. In simplest terms, this sensitivity model constrains the divergence of the counterfactuals \emph{uniformly}, rather than based on a norm in the intervention space. We refer to this baseline as ``MSM''.
To highlight the difference in the kind of uncertainty under consideration, as well as to represent an approach from distributional RL~\citep{bellemare2017distributional}, we present an additional baseline that takes lower conditional outcome quantiles in place of a causal sensitivity analysis. This baseline uses empirical uncertainties to emulate confounding uncertainty, so we termed it ``empirical''.


\begin{table}
    \centering
    \begin{tabular}{r| l l l l }
        & Observed & Hidden & Nonlinearity \\
        \midrule
        Easy & 4 & 1 & None  \\
        Medium & 8 & 8 & Sigmoid \\
        Hard & 16 & 16 & Cubic \\
    \end{tabular}
    \caption{Numbers of observed and hidden dimensions, as well as the type of nonlinearity, selected for the experimental settings with results in Table~\ref{tab:benchmark} and Figure~\ref{fig:results-scatter}.}
    \label{tab:experimental-settings}
\end{table}


The goal is to assess the viability of these three approaches for online calibration of an offline-trained controller with hidden confounding. Each of the benchmarked methods has a single sensitivity parameter---$\Gamma$ for ours and MSM, and the quantile level for the empirical baseline. We evaluated grids of these sensitivity parameters for each experiment, while ensuring that they overlapped as closely as possible in terms of relative performance. Then we identified the best-performing sensitivity value for each method and compared its total reward against that of the naive controller. These values are positive, since the possibility of no calibration is included in the search grid, setting the naive controller's total reward as a lower bound. An example calibration curve is shown in Figure~\ref{fig:gamma-curve} to illustrate how reward tends to increase, saturate, and then decrease with $\Gamma$.




\begin{figure}
    \centering
    \scalebox{0.8}{
      \input{figures/gamma-experiment0.pgf}}\vspace{-1em}
    \caption{Reward improvement scores for the first ``easy'' experiment with our sensitivity analysis, as a function of $\log\Gamma$. This plot exhibits the trade-offs for increasing sensitivity. A low $\Gamma$ encourages more, potentially careless action, whereas a high $\Gamma$ could make the controller too conservative. }
    \label{fig:gamma-curve}
\end{figure}





For maximal generality in the simulations, we sampled multivariate stochastic differential equations (SDEs) of the Ornstein-Uhlenbeck (OU) process form~\citep{karatzas}, with varying dimensionality and degree of nonlinearity. These processes had completely random structure, and were filtered for stability and significant confounding. We tested three distinct settings---``easy'', ``medium'', and ``hard''---with 256 independent experiments each. For the easy setting we trained simple linear SDE models, whereas for the medium and hard settings we trained neural models with longer windows into the past. For all settings, the controller's task was to minimize the squared value of the first dimension by controlling the second dimension.
Concretely, the SDEs took the form
$\dd S_t = -h(A S_t)\dd t + \sigma\dd W_t$
where $S_t$ is the full state vector, $A$ is a mixing matrix, and $h(\cdot)$ is the optional nonlinearity given as a gradient of a convex function. 
$R_t$ is given by $R_t = -\big( S_t^{[0]} \big)^2 $ where the $\cdot^{[i]}$ superscript represents the $i^\text{th}$ component of $S_t$.
The control is imposed on $S_t^{[1]}$, and the observed component is 
$\dd O_t = \dd S_t^{[0:k]}$
where $k$ is the set of dimensions observed.
The action $a_t$ is applied to $O_t^{[1]}$ at each time step.
Different realizations of processes were simulated in discrete time through the Euler-Maruyama scheme. 
Additional details are available in \S\ref{app:experiments}.

\begin{figure}
    \centering
    \scalebox{0.8}{
      \input{figures/experiments-scatter.pgf}}\vspace{-1em}
    \caption{Comparing the pairwise results of Table~\ref{tab:benchmark} between our sensitivity model and the MSM baseline. We display improvements in reward over the naive controller for each of the 256 experiments. More points being to the right of the diagonal suggests that our model performed better.}
    \label{fig:results-scatter}
\end{figure}




Results are mainly displayed as relative improvements in reward over the naive controller. Table~\ref{tab:benchmark} shows average improvements for our method compared with the baselines, across the three experimental settings described in Table~\ref{tab:experimental-settings}. Improvements per experiment are plotted for our method versus the MSM in Figure~\ref{fig:results-scatter}. In aggregate, our method appears to yield a 20\%+ increase in reward over the MSM.

