\section{Introduction}\label{sec:intro}





Learning to \emph{predict} dynamics that are partially observed may be unhelpful for \emph{taking action} in those dynamics, especially if the hidden state confounds the relationship between action and outcome. We consider the problem of using a predictive model trained on offline trajectory data for the purpose of online control. We assume that partial observability induces hidden confounding in the offline data-generating process.

Our insights center on the \emph{identification} of dynamics subject to intervention, like an action policy for online (closed-loop) control.
We study the setting in which we only have access to a confounded predictive model that can generate samples of the dynamics, following the offline distribution of observables generated by an unknown policy acting on the hidden state.
By projecting entire trajectories of possible actions into the future, a controller may find the trajectory with the best predicted outcome, and act on it. With hidden confounding, the controller needs to assess the worst-case outcome by taking into account any known constraints on the hidden confounding. This makes the action policy more conservative. If the worst case represents a truly valid instantiation of possible hidden confounding, then the controller performs as well as possible and is considered minimax optimal.



\begin{table}
    \centering
    \begin{tabular}{r| l l l l }
        & Easy & Medium & Hard \\
        \midrule
        Ours & \textbf{20.8}\% & \textbf{17.9}\% & \textbf{22.8}\%  \\
        MSM & 15.3\% & 13.0\% & 21.6\% \\
        Empirical & 13.1\% & 15.9\% & 20.6\% \\
    \end{tabular}
    \caption{Results of partially identified controllers expressed as average improvement in reward over naive model predictive control (MPC) for $2^8=256$ i.i.d experiments in each column. Standard errors were all about 1\%.} %
    \label{tab:benchmark}
\end{table} %




\paragraph{Contributions.}
We propose a continuity constraint on counterfactual probabilities that admits an adaptive method for partially identifying the outcomes of action trajectories. This is used as a \emph{sensitivity model} for the hidden confounding by setting a single parameter $\Gamma\geq 0$, associated with a norm over action trajectories, that quantifies the extent of hidden confounding and can be calibrated online (Definition~\ref{def:sensitivity-model}).
We formally characterize sharp bounds for the partially identified outcome of an action trajectory (Lemma~\ref{lem:sharpness}). The sharp lower bound naturally gives rise to a minimax model-predictive controller (Lemma~\ref{lem:control}). We implement such a controller by augmenting a practical algorithm that is commonly used in deep reinforcement learning (Algorithm~\ref{alg:mppi}). Finally, we show empirically how this algorithm yields higher rewards on average compared to alternative methods across a wide diversity of linear and nonlinear synthetic experiments (Table~\ref{tab:benchmark}).







\section{Background}
The main question we want to answer is how to design a controller using an environment's \emph{observable} behavior from actions taken under an unknown reference policy.
Our focus is on the challenges that emerge when those actions interact with underlying state that is not observed.
This is common; examples include robotics with limited sensing and text-based agents interacting with humans~\citep{lang24}.

The field of artificial intelligence is beginning to realize the promise in deploying large (self-)supervised predictive models as \emph{agents} to interact with the world~\citep{acharya2025agentic,wang2024voyager}.
This agentic viewpoint can be cast as a predictive control problem, since the agent must learn to act from observations of the environment in order to achieve diverse goals.
We wish to emphasize that without abundant feedback from the environment, and without full observability of the relevant state of the world, even the most capable predictive models could fail dramatically for planning actions to achieve a real-world goal~\citep{saghafian2024ambiguous}.
It is a problem of \emph{identifiability} when a model's predictions do not translate to interventions~\citep{peters17}. %

This paper considers one particular obstacle to identifiability: that of hidden confounding.
Hidden confounding can easily manifest within the agentic paradigm because
foundation models are seldom trained directly on the tasks that an agent would seek to accomplish. While reinforcement learning (RL) is employed to improve alignment or reasoning capabilities~\citep{guo2025deepseek}, the training process does not collect new data from the world, so the foundation model does not explore or learn from its own actions as a hypothetical agent. Interventional data are much more costly to obtain than the observational datasets that enable foundation models.




\begin{figure}
    \centering
    \textsf{Model Predictive Control as Causal Inference}
    \includegraphics[width=0.75\linewidth]{figures/intro.pdf}
    \caption{When planning to take actions in an online (interventional) setting, a dynamics model trained offline (on observational data) can usually be trusted more for action trajectories that remain near the reference policy. Trajectories are ``off-policy'' when they are generated by a new learned policy, which can be subject to hidden confounding. }
    \label{fig:intro}
\end{figure}



\paragraph{Motivating example.}
To motivate the problem setting, we consider an application of market impact~\citep{gueant2016financial} of agents in the financial sector~\citep{bai2025review}.
One may wish to understand how to optimally rebalance a portfolio by executing specific trades.
However, the trader has not observed the full dynamics of how market participants reacted to past trades.
As the trader interacts with the market, additional hidden factors may influence the evolution of the price. 
This may lead to the trader wanting to execute the trade according to an upper or lower bound on the expected price impact under the hidden confounding. 
The proposed method considers a controller that achieves this goal.

\subsection{Model Predictive Control}
A \emph{world model} that can predict the dynamics of actions and future states can readily implement agents through \emph{model predictive control} (MPC)~\citep{clarke1987generalized}, a widely celebrated family of algorithms for adaptive control~\citep{fernandez1995model} that project entire trajectories of states and actions into the future, select the best one, and execute the first action in that trajectory. MPC in model-based RL enables fast learning~\citep{lale2021model,lale2024falcon}, as well as generalizable and multi-task agents~\citep{hansen2024tdmpc,hu2023planning}. Moreover, world models trained online can be used offline to learn agents for novel tasks~\citep{georgiev2024pwm,hafner2023mastering}.
A trend is emerging for offline-trained world models in realms that were traditionally suited for online RL~\citep{lecun2022path,ajay2023is} likely due to data accessibility and the demonstrated scalability of self-supervised learning~\citep{chen2020big}. %

The lack of identifiability in a partially observed system holds for MPC as well, specifically when using an offline-trained world model for novel tasks. Our goal is to provide an approach for \emph{partially identifying} the outcomes of an agent's actions while leveraging a world model's predictions. To do so, it is necessary to assume a structural constraint on the impact of, or \emph{sensitivity} to, hidden confounding, manifesting as a form of continuity in the action space. We propose theoretically-guaranteed conservative MPC under the worst-case scenarios admitted by the partial identification. This work is in a similar spirit, but orthogonal to ``offline RL'' with hidden confounding; we elaborate below.


\subsection{Offline Reinforcement Learning}
Offline RL refers to the class of methods for learning action policies from data collected under a reference policy that cannot be updated, and that is not from an expert---i.e., does not maximize rewards for the task of interest. Most offline RL algorithms borrow from online RL with the addition of regularization to protect against domain shift~\citep[e.g.][]{kumar2020}. %
They tend to involve learning a state-action value \emph{Q-function} for the current action policy and iteratively optimizing a new policy on the basis of a Bellman equation. These approaches can be efficient and robust~\citep{panaganti22}. Sensitivity to hidden confounding has also been incorporated through structural constraints~\citep{bennett2024efficient}, latent variables~\citep{pace2024delphic}, as well as adjustment through auxiliary variables~\citep{wang2025offpolicy}. %

Our focus on MPC diverges from those lines of work. The scope of this paper assumes access to an accurate (offline, confounded) world model, with the task of using it for online control. This regime is becoming relevant to real-world problems with the emergence of foundation models, yet also contrasts with classical control theory by allowing the dynamics---crucially, of the hidden confounders---to largely remain a black box. The constraint on the hidden confounders is meant to be adaptive to most data-generating processes. %


\subsection{Causal Inference}

Our main insight is that recent theoretical tools from the intersection of causal inference and machine learning can be deployed to this context of partially identified predictive control. Causal inference is primarily concerned with the identification and estimation of causal relationships among variables. Many have studied the necessary and sufficient conditions for identifying one variable's outcome from another variable's intervention~\citep{imbens15}. In the presence of hidden confounding, researchers have developed \emph{sensitivity models} that impose structural constraints on the confounders, and yield tractable bounds for the \emph{causal estimand}.
Hidden confounders are distinct from latent confounders, the latter being possible to infer to some extent. In general, data cannot carry information about hidden confounders, and structural constraints can help to quantify a model's ignorance instead.
Sensitivity models have a long history of improving the robustness of observational studies~\citep{cornfield59} and are making their way into machine-learning pipelines for the sciences~\citep{feuerriegel2024causal,haddad23}.

The push to make these methods useful in machine learning has led to more general sensitivity models: the univariate binary or discrete-intervention setting~\citep{tan} has quickly evolved to continuous~\citep{jesson22,marmarelis23} and even multivariate~\citep{frauen2024sharp} interventions.
Starting with \citet{dorn22}, progress has also been made in formally characterizing the sharpness of the bounds arising from these sensitivity models.

Considering MPC as the problem of identifying outcomes associated with \emph{entire future trajectories} of actions, a sufficiently flexible sensitivity model should yield conservative policies in the presence of hidden confounding. Figure~\ref{fig:intro} illustrates the link between off-policy and interventions. We build on recent progress and present our analysis in the framework of \emph{potential outcomes} introduced by \citet{neyman23}, which vastly simplifies notation and centers on identifiability. %


