\section{Theory}\label{sec:theory}


\begin{figure}
    \centering
    \textsf{Data Generating Process}
    \includegraphics[width=0.9\linewidth]{figures/graph.pdf}
    \caption{Graphical representation of the data-generating process. Dashed arrows correspond to the reference policy for the offline data. Dynamics are condensed into outcome $Y$, action trajectory $\Pi$, and observation history $X$; see \S\ref{sec:hidden-confounding}. }
    \label{fig:graph}
\end{figure}



The main primitive underlying the data-generating process is a partially observable Markov decision process (POMDP). The POMDP consists of a sequence of state $S_t\in\mathcal{S}$, action $A_t\in\mathcal{A}$, and observation $O_t$ random variables indexed in discrete time. $O_t$ is the observable part of the full state $S_t$. There is also a reward $R_t\in\mathbb{R}_{\geq 0}$ that depends on the current state-action pair $(S_t,A_t)$. The evolution of the POMDP is governed by a transition kernel $\tilde T(S_{t+1}|S_t,A_t)$ that is assumed to be unknown. All that is observed at each time step is the triplet $W_t\triangleq (O_t,A_t,R_t)$; the full state is hidden, making the process \emph{partially observable}.

As in standard reinforcement learning, the goal of an agent is to choose actions that maximize expected future rewards with infinite horizon and discounting factor $\gamma\in(0,1)$. Without loss of generality, denote the present context as $t=0$. The agent acts on $A_0$ by picking from a set of choices $\mathcal{A}$ using the current observable state $O_0$ as well as any available past $(W_{-1},W_{-2},\cdots)$. The agent's objective is for repeated applications of its action policy to maximize $\E[ \sum_{t=0}^\infty \gamma^t R_t]$.

Our setting has an offline and online component. Data are collected offline under an unknown reference policy $\tilde\pi(A_t|S_t)$ and a predictive model is learned on the observables. We therefore assume access to samples from the conditional dynamics distributions ${P_{W_0,W_1,\cdots\mid W_{-1},W_{-2},\cdots}}$ that can be approximated arbitrarily well by a deep generative model. Any distribution $P$ involving observables $W_t$ is assumed to correspond to the offline data-generating process.

The agent must use $P$ to act online, replacing $\tilde\pi$ in the data-generating process with its own policy $\pi$ that aims to maximize the discounted reward in expectation. This is called \emph{off-policy learning} because data generated from the agent's own policy are not available while learning. The domain shift between the offline and online POMDPs cannot be anticipated before the agent starts acting. In particular, because the full state $S_t$ is unobserved, $P$ is not guaranteed to help produce optimal actions even though it is the exact conditional distribution of observables including actions and rewards.


\subsection{Hidden Confounding}\label{sec:hidden-confounding}
We simplify notation according to Figure~\ref{fig:graph} before proceeding with identification. The outcome of interest is the future discounted reward $Y\triangleq \sum_{t=0}^\infty \gamma^t R_t$. The agent makes plans on the basis of action trajectories taking the form $\Pi\triangleq [A_0\ A_1\ A_2\ \cdots]$ belonging to an $\mathcal{A}$-product space of finite or even infinite dimensionality, depending on the planning horizon.
The agent's context is the current and past observable states, as well as actions, $X\triangleq [O_0\ \ O_{-1} \ A_{-1}\ \ O_{-2} \ A_{-2}\ \cdots]$ \citep{littman01predictive}.

In the MPC framework, the optimal controller is that which selects the action trajectory $\Pi$ that maximizes the reward $Y$. This notation allows the abstraction of the dynamics in a (partially observed) Markov decision process. The optimal plan starting at a state $s\in\mathcal{S}$ is ultimately specified by
\begin{equation}\label{eq:optimal-plan}
    \pi^*\in\arg\max_{\pi\in\mathcal{T}} \E[Y\mid \Pi=\pi, S_0=s].
\end{equation}
$\mathcal{T}$ denotes the set of all feasible action trajectories: like a power set of $\mathcal{A}$. Since $S_0$ is not observed, it must be inferred with all of the available information in $X$. However, some of the statistical variation in $S_0$ will probably leak through, and manifest as residual (hidden) confounding.


\subsection{Potential Outcomes}
The naive solution to action-trajectory selection would be like Equation~\eqref{eq:optimal-plan} but simply using the observables instead.
\begin{equation}\label{eq:naive-plan}
    \pi^*_\text{naive}\in\arg\max_{\pi\in\mathcal{T}} \E[Y\mid \Pi=\pi, X=x]
\end{equation} %
Clearly, if $X$ cannot perfectly predict $S_0$, then these solutions might be different. A solution to Equation~\eqref{eq:naive-plan} might yield a high expected reward in the offline setting, but that is not guaranteed in the online setting in which any confounding between $\Pi$ and $S_0$, conditioned on $X$, is removed. The online outcome $\E[Y\mid \Pi=\pi^*_\text{naive}, S_0=s_0]$ is unidentifiable.

We require a simple notation for the outcomes of a potential online intervention in an instance described by the observable $X$. The potential-outcomes framework \citep{rubin74,imbens15} provides such a theory, and can flexibly handle vector-valued interventions~\citep{marmarelis24_policy}.
\begin{definition}[Potential Outcome]\label{def:potential-outcome} %
    For every decision-making instance, the realized outcome $Y$ is the future reward from the offline dynamics, and the potential outcome $Y(\pi)$ associated with any action trajectory $\pi$ is the future reward that would be realized online from following actions $\pi$.
\end{definition} %
Potential outcomes and realized outcomes follow a joint distribution because each individual instance of dynamics is considered to have its own set of potential outcomes indexed by $\pi$.
For a particular instance of state and observable $(s_0,x)$, the \emph{marginal} behavior of one potential outcome can be expressed as
\begin{equation*}
    \big(Y(\pi) \mid X=x\big) \sim \big(Y \mid \Pi=\pi, S_0=s_0\big).
\end{equation*}

The most relevant insight is that without conditioning on the full state $S_0$, the offline action trajectory $\Pi$ itself can reveal information about $S_0$ (statistically). Hence, $Y|\Pi,X$ is not predictive of $Y(\pi)$ because the underlying $S_0$ is not fixed across different $(\Pi=\pi)$ conditions.
We formally define the counterfactual as
\begin{definition}[Counterfactual]\label{def:counterfactual}
    Conditional expressions of the form $\ {Y(\pi)\mid \Pi,X}\ $ are called counterfactual because they describe ``what-if'' scenarios where offline $\Pi$ is observed, and we want to know the online outcome of a different $\pi$.
\end{definition}

\begin{definition}[Causal Estimand]\label{def:estimand}
    The quantity of interest for partial identification is $\E\big[Y(\pi)\mid X=x\big]$, to be evaluated at any $(\pi,x)$ with support on $P_{\Pi,X}$.
\end{definition}



