\section{Experimental Details}\label{app:experiments}
Source code for all experiments is included in the supplementary material. Key details are listed below.

\paragraph{MPC instantiation.}
The partially identified MPPI algorithm described in Algorithm~\ref{alg:mppi} was run with the following uniformly set hyperparameters: sample $512$ action trajectories during search iteration, sample $64$ reward trajectories for each action trajectory, and then choose the top $32$ action trajectories for updating the policy estimate. We used five search iterations. Additionally, since MPPI computes weights based on a softmax on the rewards, we first normalized the rewards by dividing them by their mean across action trajectories and then set a temperature of $0.01$. We used a horizon of $16$ with un-discounted rewards as a practical substitute for the infinite-horizon, discounted ideal setting.

\paragraph{Simulation.} %
The simulations for the easy, medium, and hard settings described in Table~\ref{tab:experimental-settings} all used a simulation time step of $\dd t = 0.1$ and noise scale of $\sigma=0.1$, though the learning and control was performed at unit time intervals. The mixing matrix was kept relatively sparse in order to induce a variety of dependencies among the variables in the SDE. Specifically, an entry in the mixing matrix was nonzero with probability $2/n$ where $n$ was the dimensionality of the whole SDE---including the hidden variables. Nonzero entries were drawn from a standard normal distribution. Then, dimensions in the stochastic process were reordered such that the first dimension ``received'' the most influence from the other dimensions and the second dimension ``gave'' the most influence to the others. This arrangement made it more feasible for the second dimension to have a chance at controlling the first dimension, as the control problem was posed. Finally, through rejection sampling on candidate mixing matrices, we ensured a high degree of hidden confounding. We estimated the Pearson correlation coefficients between \emph{a)} the hidden dimensions and the action dimension's future, and \emph{b)} the hidden dimensions and the reward dimension's future. We required the geometric mean of \emph{(a)} and \emph{(b)} to be greater than $0.33$, a threshold that rejected the majority of the processes. For the hard setting, we lowered that threshold to $0.20$ because it was difficult to find processes that would not be rejected.

\paragraph{Estimation.}
The easy setting was learned with a linear model while the medium and hard settings relied on a neural network (multilayer perceptron) with two hidden layers of size $256$ and SELU activations. The neural networks had access to the past four time points for predicting the drift term (sans noise) of the next time point.

\paragraph{Calibration.}
The grid search to choose the top sensitivity parameter across our method and the MSM \& empirical baselines was tuned for efficiency and balance. Figure~\ref{fig:calibration-histograms} shows the frequencies of these calibrations along respective grids.

\begin{figure}
    \centering
    \scalebox{0.8}{
      \input{figures/calibration-histograms.pgf}}
    \caption{ Histograms of the occurrence of the top reward along the calibration grids for $\log\Gamma$ that were considered for each experimental setting. Since the sensitivity parameters were generally incommensurable between our formulation and the MSM, we verified empirically that their respective grids were balanced, and with overlap in frequencies. }
    \label{fig:calibration-histograms}
\end{figure}
