\section{Challenges of Causal Inconsistency}\label{sec:_2}
We will focus on a sequential decision-making problem in the Markov Decision Process (MDP, \citet{puterman1994markov}) where the agent intervenes on a sequence of actions to optimize certain rewards/primary outcomes. The standard MDP formalism focuses on the perspective of learners who could actively intervene in the environment. Consequently, the data collected from randomized experiments are not contaminated with unobserved confounding bias, which is generally assumed away in the model. However, when considering offline data collected by passive observation, the learner may not necessarily have deliberate control over the behavioral policy generating the data. Consequently, this could lead to confounding bias in various decision-making tasks \citep{kallus2018confounding,zhang2020causal,kumor2021causal,guo2022provably,ruan2024causal}. In this paper, we will consider an extended family of MDPs that explicitly models the presence of unobserved confounders.
\begin{definition}
\label{def:cmdp}
    A Confounded Markov Decision Process (CMDP) $\mathcal{M}$ is a tuple of $\langle \1S, \1X, \1Y, \1U, \1F, P \rangle$ where (1) $\1S, \1X, \1Y$ are, respectively, the space of observed states, actions, and rewards; (2) $\1U$ is the space of unobserved exogenous noise; (3) $\1F$ is a set consisting of the transition function $f_S: \1S \times \1X \times \1U \mapsto \1S$, behavioral policy $f_X: \1S \times \1U \mapsto \1X$, and reward function $f_Y: \1S \times \1X \times \1U \mapsto \1Y$; (4) $P$ is an exogenous distribution over the domain $\1U$.
\end{definition}
Throughout this paper, we will consistently assume the state-action domain $\1X \times \1S$ to be finite; the reward domain $\1Y$ is bounded in a real interval $[a, b] \subset \3R$. 

Consider a demonstrator agent interacting with a CMDP $\1M$, generating the offline data. For every time step $t = 1, \dots, T$, the nature first draws an exogenous noise $U_t$ from the distribution $P(\1U)$; the demonstrator then performs an action $X_t \gets f_X(S_t, U_t)$, receives a subsequent reward $Y_t \gets r_t(S_t, X_t, U_t)$, and moves to the next state $S_{t + 1} \gets f_S(S_t, X_t, U_t)$. The observed trajectories of the demonstrator (from the learner's perspective) are thus summarized as the observational distribution $P(\bar{\*X}_{1:T}, \bar{\*S}_{1:T}, \bar{\*Y}_{1:T})$, i.e.,
\begin{align*}
    P(\bar{\*x}_{1:T}, \bar{\*s}_{1:T}, \bar{\*y}_{1:T}) = P(s_1) \prod_{t=1}^T \bigg ( \int_{\1U}  \I_{s_{t+1} = f_S(s_t, x_t, u_t)}&\\
    \I_{x_t = f_X(s_t, u_t)}\I_{y_t = f_Y(s_t, x_t, u_t)} P(u_t) \bigg )&
\end{align*}
\Cref{fig:_2_1_mdp} shows the causal diagram $\1G$ \citep{bareinboim2022pearl} describing the generative process generating the offline data in CMDPs. More specifically, solid nodes represent observed variables $X_t, S_t, Y_t$, and arrows represent the functional relationships $f_X, f_S, f_Y$ among them. By convention, exogenous variables $U_t$ are often not explicitly shown; bi-directed arrows $X_t \leftarrow \rightarrow Y_t$ and $X_t \leftarrow \rightarrow S_{t+1}$ indicate the presence of an unobserved confounder (UC) $U_t$ affecting the action, state, and reward simultaneously. These bi-directed arrows (highlighted in \textcolor{blue}{blue}) characterize the spurious correlations among action $X_t$, reward $Y_t$, and state $S_{t+1}$ in the offline data, violating the condition of no unmeasured confounding (NUC, \citet{robbins1985some}). Such violations could lead to challenges in off-policy evaluation, which we will discuss for the remainder of this section.

%Examining the causal diagram of \Cref{fig:_2_1_mdp} reveals properties of the observational data generated in CMDPs. First, the Markov property holds \citep{puterman1994markov}. That is, for every time step $t > 1$, the current state $S_t$ ``block'' all pathways from previous nodes (e.g., $S_{t-1}$) to the future nodes (e.g., $S_{t+1}$); the definition of blockage follows \citep[Def.~1.2.3]{pearl:2k}. This means that the state action history $\bar{\*X}_{1:t-1}, \bar{\*S}_{1:t-1}$ are independent of the future outcomes $\bar{\*X}_{t:T}, \bar{\*S}_{t+1:T}, \bar{\*Y}_{t+1:T}$ given the current state $S_t$. Second, 

\paragraph{Off-Policy Evaluation.} A policy $\pi$ in a CMDP $\1M$ is a decision rule $\pi: \1S \mapsto \1X$ mapping from state to action. Similarly, $\pi(x_t \mid s_t)$ is a stochastic policy mapping from state space $\1S$ to a distribution over action space $\1X$. An intervention $\doo(\pi)$ is an operation that replaces the behavioral policy $f_X$ in CMDP $\1M$ with the policy $\pi$. Let $\1M_{\pi}$ be the submodel induced by intervention $\doo(\pi)$. The interventional distribution $P_{\pi}(\bar{\*X}_{1:T}, \bar{\*S}_{1:T}, \bar{\*Y}_{1:T})$ is defined as the joint distribution over observed variables in $\1M_{\pi}$, i.e.,
\begin{equation}
    \begin{split}
        P_{\pi}(\bar{\*x}_{1:T}, \bar{\*s}_{1:T}, \bar{\*y}_{1:T}) = P(s_1) \prod_{t = 1}^{T} \bigg (\pi(x_t \mid s_t)&\\
    \1T(s_t, x_t, s_{t+1})\1R(s_t, x_t, y_t)&\bigg)
    \end{split}
\end{equation}
where the transition distribution $\1T$ and the reward distribution $\1R$ are given by, for $t = 1, \dots, T$,
\begin{align}
	\1T(s_t, x_t, s_{t+1}) = \int_{\1U} \I_{s_{t+1} = f_S(s_t, x_t, u_t)} P(u_t) \\
	\1R(s_t, x_t, y_t) = \int_{\1U} \I_{y_t = f_Y(s_t, x_t, u_t)} P(u_t)
\end{align}
For convenience, we write the reward function $\1R(s, x)$ as the expected value $\sum_{y} y \1R(s, x, y)$. 


\begin{figure}[t]
\centering%(d)
		\begin{tikzpicture}
			\def\outerr{3.2}
			\def\innerr{3}
			\node[vertex] (S1) at (-1, -1) {S\textsubscript{1}};
			\node[vertex] (X1) at (0, 0) {X\textsubscript{1}};
			\node[vertex] (Y1) at (0, -2) {Y\textsubscript{1}};
			\node[vertex] (S2) at (1, -1) {S\textsubscript{2}};
			\node[vertex] (X2) at (2, 0) {X\textsubscript{2}};
			\node[vertex] (Y2) at (2, -2) {Y\textsubscript{2}};
			\node[vertex] (S3) at (3, -1) {S\textsubscript{3}};
			\node[vertex] (X3) at (4, 0) {X\textsubscript{3}};
			\node[vertex] (Y3) at (4, -2) {Y\textsubscript{3}};

			\draw[dir] (S1) to (S2);
			\draw[dir] (S1) to (X1);
			\draw[dir] (S1) to (Y1);
			\draw[dir] (X1) to (Y1);
			\draw[dir] (X1) to (S2);

			\draw[bidir, draw=betterblue] (X1) to [bend left = 30] (S2);
			\draw[bidir, draw=betterblue] (X1) to [bend left = 30] (Y1);
			\draw[bidir] (Y1) to [bend right = 30] (S2);

			\draw[dir] (S2) to (S3);
			\draw[dir] (S2) to (X2);
			\draw[dir] (S2) to (Y2);
			\draw[dir] (X2) to (Y2);
			\draw[dir] (X2) to (S3);

			\draw[bidir, draw=betterblue] (X2) to [bend left = 30] (S3);
			\draw[bidir, draw=betterblue] (X2) to  [bend left = 30] (Y2);
			\draw[bidir] (Y2) to [bend right = 30] (S3);

			\draw[dir] (S3) to (X3);
			\draw[dir] (S3) to (Y3);
			\draw[dir] (X3) to (Y3);
			\draw[dir] (S3) to node {} (4.7, -1);

			\draw[bidir, draw=betterblue] (X3) to [bend left = 30] (Y3);

			\begin{pgfonlayer}{back}
				%            \node[circle,fill=betterred!25,draw=betterred!65,dashed,minimum size=2*\outerr mm] at (X1) {};
				\node[circle,fill=betterblue!65,draw=none,minimum size=2*\innerr mm] at (X1) {};
				%            \node[circle,fill=betterred!25,draw=betterred!65,dashed,minimum size=2*\outerr mm] at (X2) {};
				\node[circle,fill=betterblue!65,draw=none,minimum size=2*\innerr mm] at (X2) {};
				\node[circle,fill=betterblue!65,draw=none,minimum size=2*\innerr mm] at (X3) {};
				%\node[circle,fill=betterblue!25,draw=none,minimum size=2*\outerr mm] at (Y) {};
				\node[circle,fill=betterred!65,draw=none,minimum size=2*\innerr mm] at (Y1) {};
				\node[circle,fill=betterred!65,draw=none,minimum size=2*\innerr mm] at (Y2) {};
				\node[circle,fill=betterred!65,draw=none,minimum size=2*\innerr mm] at (Y3) {};
			\end{pgfonlayer}
		\end{tikzpicture}
	\caption{Causal diagram representing the data-generating mechanisms in a Markov Decision Process (MDP)}
    \label{fig:_2_1_mdp}
\end{figure}

Fix a discounted factor $\gamma \in [0, 1]$. A common objective for an agent is to optimize its cumulative return $R_t = \sum_{i = 0}^{\infty} \gamma^i Y_{t + i}$. In analysis, we often evaluate the state value function $V_{\pi}(s)$, which is the expected return given the agent's starting state $S_t = s$. That is, $V_{\pi}(s) = \invE{R_t \mid S_t = s}{\pi}$. A similar state-action value function $Q_{\pi}(s, x)$ is defined as the expected return starting from state $s$, taking action $x$ and thereafter following policy $\pi$, i.e., $Q_{\pi}(s, x) = \invE{R_t \mid S_t = s}{X_t \gets x, \pi}$. One could recursively evaluate the value function of any state $s$ using the \emph{Bellman Equation} \citep{bellman1966dynamic} given by,
\begin{equation}
        V_{\pi}(s) = \sum_{x} \pi(x \mid s) \Big (\1R(s, x) \\
        + \gamma \sum_{s'} \1T(s, x, s') V_{\pi}(s')\Big) \label{eq:_2_bellman_v}
\end{equation}
Similarly, an analogous equation for the state-action value function is given by
\begin{align}
	Q_{\pi}(s, x) & = \1R(s, x) + \gamma \sum_{s'} \1T(s, x, s') V_{\pi}(s')\label{eq:_2_bellman_q}
\end{align}
In off-policy evaluation, the agent (i.e., learner) attempts to estimate the effects of a candidate policy $\pi(x|s)$ from the observational data generated by the behavior policy $f_X$ (demonstrator). Standard off-policy methods focus on the identifiable setting where the transition distribution $\1T$ and reward function $\1R$ remain consistent in both the interventional $P_{\pi}$ and observational distribution $P$. Formally,
\begin{definition}[Causal Consistency]\label{def:_2_consist}
    For a CMDP $\1M$, the Causal Consistency holds if the following statement holds, for every time step $t = 1, 2, \dots$,
	\begin{equation}
	    \begin{split}
		 &\1T(s_t, x_t, s_{t+1}) = P\Parens{s_{t+1} \mid s_t, x_t},\\
         &\1R(s_t, x_t, y_t) =P \Parens{y_t \mid s_t, x_t}
	\end{split}\label{eq:_2_ope}
	\end{equation}   
\end{definition}
When Causal Consistency holds, the learner could recover the parametrization of the transition distribution $\1T$ and reward function $\1R$ from the observational data, following the identification formula in \Cref{eq:_2_ope}. Several off-policy algorithms have been proposed to estimate the effect of candidate policies from finite observations under causal consistency \citep{watkins1989learning,watkins1992q,swaminathan2015counterfactual,jiang2015doubly,precup2000eligibility,munos2016safe}. There exist graphical criteria in the literature \citep{pearl:rob95,shpitser:etal10,perkovic:15} to evaluate whether causal consistency (\Cref{def:_2_consist}) holds from causal knowledge of the environment, including the celebrated \emph{backdoor} criterion \citep[Def.~3.3.1]{pearl:2k}. 

However, in many practical applications, causal consistency could be fragile and does not necessarily hold due to some violations in the generative process. These include: (1) there exists an unobserved confounder affecting the action $X_t$ and subsequent outcomes $Y_t$, $S_{t+1}$ simultaneously (blue, dashed arrows in \Cref{fig:_2_1_mdp}); (2) there is no overlap in the support between the target and behavior policies, i.e., the propensity score $P(x_t \mid s_t) = 0$ for some state-action pair $s_t, x_t$. When either of these violations occurs, applying standard off-policy methods may fail to recover the expected return of the target policy, leading to estimation bias. The following example illustrates such challenges.

\begin{figure}[t]
\centering
\hfill
        \begin{subfigure}{0.49\linewidth}\centering%(d)
		\includegraphics[width=\linewidth]{figures/lava_maze_1.png}
		\caption{}
		\label{fig:_2_1_a}
	\end{subfigure}\hfill
	\begin{subfigure}{0.49\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_policy_1.png}
		\caption{}
		\label{fig:_2_1_b}
	\end{subfigure}\hfill\null
	\caption{A windy gridworld environment where the red arrow represents the agent and green square is the goal state; the agent can take five actions - \texttt{up}, \texttt{down}, \texttt{right}, \texttt{left}, and \texttt{stay-put}; the wind can blow in five directions - \texttt{north}, \texttt{south}, \texttt{west}, \texttt{east}, and \texttt{no-wind}. The agent attempts to reach the goal without stepping into lava.}
    \label{fig:_2_1_windy}
\end{figure}

\begin{example}\label{exp:_2_1}
	Consider a Windy Gridworld described in \Cref{fig:_2_1_windy}, which we adapted from one of Deepmind's AI safety Gridworlds \citep{leike2017ai}. The red triangle represents the agent, and the green square represents the goal state. The agent can take five actions $X_t$ - \texttt{up}, \texttt{down}, \texttt{right}, \texttt{left}, and \texttt{stay-put}. The agent receives a constant reward $Y_t \gets 1$ if it reaches the goal state. On the other hand, the task fails, and the agent receives no reward ($Y_t \gets 0$) if it steps into the lava (orange tiles) on its way.
    
    Additionally, the agent's movement is affected by the wind direction $U_t$, which could take five values at each time-step, including - \texttt{east}, \texttt{south}, \texttt{west}, \texttt{north}, and \texttt{no-wind}. The distribution of the wind direction depends on the agent's location. As an example, \Cref{fig:_2_1_a} shows samples of wind directions for every position at a single time step. In general, the wind tempts to push the agent toward the lava; the closer the agent gets to the lava, the stronger the wind becomes. If the agent decides to move (i.e., $X_t \gets \texttt{up}, \texttt{down}, \texttt{right}, \texttt{left}$), its next state of the agent is shifted by both its action and the wind direction through the mechanism $S_{t+1} \gets S_t + X_t + U_t$. Otherwise, the agent will stay put ($X_t \gets \texttt{stay-put}$) at its current position, regardless of the wind direction, i.e., $S_{t+1} \gets S_t$.
    
    \Crefrange{fig:_2_2_a}{fig:_2_2_b} shows the value function estimation obtained by standard off-policy methods, including temporal difference with importance sampling \citep{precup2000eligibility}, and tree backup \citep{sutton1998reinforcement}. For comparison, we also include in \Cref{fig:_2_2_c} the actual value function computed from the ground-truth model using value iteration. The simulation reveals that standard off-policy evaluation deviates from the ground truth return. In this case, the demonstrator will only move if there is no wind, which makes the shorter path appear less risky than it actually is. The wind direction $U_t$ is thus an unobserved confounder affecting both the action $X_t$ and next state $S_{t+1}$ in the offline data, violating causal consistency. See the complete technical report \citep[Appendix C]{zhang2024eligibility} for additional discussions on the windy Gridworld environment.
\end{example}

\begin{figure*}[t]
	\begin{subfigure}{0.24\linewidth}\centering%(d)
		\includegraphics[width=\linewidth]{figures/lava_1_is.png}
		\caption{Off-Policy TD}
		\label{fig:_2_2_a}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_1_tree.png}
		\caption{Tree Backup}
		\label{fig:_2_2_b}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_1_opt_v.png}
		\caption{$V^{*}(s)$}
		\label{fig:_2_2_c}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_1_opt_q.png}
		\caption{$Q^*(s, \texttt{right})$}
		\label{fig:_2_2_d}
	\end{subfigure}\hfill\null

	\begin{subfigure}{0.24\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_1_lower_v.png}
		\caption{$\underline{V^*}(s)$}
		\label{fig:_2_2_e}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering%(d)
		\includegraphics[width=\linewidth]{figures/lava_1_upper_v.png}
		\caption{$\overline{V^*}(s)$}
		\label{fig:_2_2_f}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering
		\setlength{\abovecaptionskip}{5pt}
		\includegraphics[width=\linewidth]{figures/lava_1_lower_q.png}
		\caption{$\underline{Q^*}(s, \texttt{right})$}
		\label{fig:_2_2_g}
	\end{subfigure}\hfill
	\begin{subfigure}{0.24\linewidth}\centering
		\includegraphics[width=\linewidth]{figures/lava_1_upper_q.png}
		\caption{$\overline{Q^*}(s, \texttt{right})$}
		\label{fig:_2_2_h}
	\end{subfigure}\hfill\null
	\caption{(\subref{fig:_2_2_a} - \subref{fig:_2_2_b}) Value function estimation obtained by standard off-policy methods; (\subref{fig:_2_2_c} - \subref{fig:_2_2_d}) The ground-truth value function computed from the underlying model; (\subref{fig:_2_2_e} - \subref{fig:_2_2_h}) Lower and upper bounds on the value functions obtained by causally enhanced off-policy algorithms using eligibility traces (\texttt{C-TD($\lambda$)} and \texttt{C-TB($\lambda$)})}
	\label{fig:_2_2}
\end{figure*}

\subsection{Partial Causal Identification in Confounded MDPs}\label{sec:_2_1}
This section will introduce partial identification methods for off-policy evaluation that is robust to the unobserved confounding and no overlap. For every time step $t = 1, 2, \dots$, let the reward $Y_t$ be bounded in a real interval $[a, b]$. By applying a similar bounding strategy in \citep{manski:90,zhang2019near,joshi2024towards}, we derive the following bounds over the transition probability distribution $\1T$ for every realization $(s, x, s') \in \1S \times \1X \times \1S$,
\begin{equation}
    \begin{split}
        & \1T\Parens{s, x, s'} \geq \widetilde{\1T}\Parens{s, x, s'}P(x \mid s) \\
        &\1T\Parens{s, x, s'} \leq \widetilde{\1T}\Parens{s, x, s'}P(x \mid s) + P\Parens{\neg x \mid s} 
    \end{split}\label{eq:_2_nb_t} 
\end{equation}
where $P(x \mid s) = P\Parens{X_t = x \mid S_t = s}$ and $P(\neg x \mid s) = 1 - P(x \mid s)$; and $\widetilde{\1T}$ is the nominal transition distribution computed from the observational distribution as $\widetilde{\1T}\Parens{s, x, s'} = P\Parens{S_{t+1} = s' \mid S_t = s, X_t = x}$. Similarly, one could also derive the following bound over the reward function $\1R$ for every state-action pair $(s, x) \in \1S \times \1X$,
\begin{equation}
    \begin{split}
        & \1R\Parens{s, x} \geq \widetilde{\1R}\Parens{s, x}P(x \mid s) + a P(\neg x \mid s),\\
        &\1R\Parens{s, x}  \leq \widetilde{\1R}\Parens{s, x}P(x \mid s) + b P(\neg x \mid s) 
    \end{split}\label{eq:_2_nb_r}
\end{equation}
where $\widetilde{\1R}$ is the nominal reward function given by $\widetilde{\1R}\Parens{s, x} = \3E\Brackets{Y_t \mid S_t = s, X_t = x}$.

To bound the value function $V_{\pi}(s)$ at state $s$ induced by a candidate policy $\pi$, one could minimize/maximize the optimization program using the Bellman's equation in \Cref{eq:_2_bellman_v} as the objective function, subject to constraints in \Cref{eq:_2_nb_t,eq:_2_nb_r}. Interestingly, this optimization problem is equivalent to a linear program; solving it leads to the following \emph{extended Bellman equation}.
\begin{restatable}[Causal Bellman Equation]{theorem}{thmbellmanv}\label{thm:_2_1}
	For a CMDP $\1M$ with reward domain $\1Y = [a, b]$, for any policy $\pi(x \mid s)$, its state value function $V_{\pi}(s) \in \Brackets{\underline{V_{\pi}}(s), \overline{V_{\pi}}(s)}$ for every state $s \in \1S$, where the lower bound $\underline{V_{\pi}}$ is a solution given by the following dynamic program,
	\begin{align}
	\!\!\!\! \underline{V_{\pi}}(s) = \sum_{x} P(x\mid s) \bigg (\pi(\neg x \mid s) \Big( a + \gamma \min_{s'} \underline{V_{\pi}}(s') \Big )& \label{eq:_2_lower_v_1} \\
    +\pi(x \mid s) \Big (\widetilde{\1R}\Parens{s, x} + \gamma \sum_{s', x'} \widetilde{\1T}\Parens{s, x, s'} \underline{V_{\pi}}(s') \Big)\bigg ) &\label{eq:_2_lower_v_2}
    \end{align}
    Similarly, the upper bound $\overline{V_{\pi}}$ is a solution given by
    \begin{align}
	\!\!\!\!\! \overline{V_{\pi}}(s) = \sum_{x} P(x\mid s) \bigg (\pi(\neg x \mid s) \Big( b + \gamma \max_{s'} \overline{V_{\pi}}(s') \Big )& \label{eq:_2_upper_v_1} \\
    +\pi(x \mid s) \Big (\widetilde{\1R}\Parens{s, x} + \gamma \sum_{s', x'} \widetilde{\1T}\Parens{s, x, s'} \overline{V_{\pi}}(s') \Big)\bigg ) &\label{eq:_2_upper_v_2}
    \end{align}
\end{restatable}
\Cref{thm:_2_1} can be seen as an extension of the Bellman equation using the confounded observational distribution with no overlap. For instance, in the lower bound $\underline{V_{\pi}}(s)$, \Cref{eq:_2_lower_v_1} could be thought as a regularizing term measuring the uncertainty due to unobserved confounding; \Cref{eq:_2_lower_v_2} follows the standard iterative step in Bellman equation in \Cref{eq:_2_bellman_v}, measuring the expected return when the target policy's action coincides with the observed action selected by the behavior policy. Finally, both terms are weighted by the nominal propensity score $P\Parens{x\mid s}$. The same derivation also applies to the upper bound $\overline{V_{\pi}}(s)$. An analogous extended Bellman equation bounding the state-action value function from the observational distribution can also be derived as follows.
\begin{restatable}[Causal Bellman Equation]{theorem}{thmbellmanq}\label{thm:_2_2}
	For a CMDP $\1M$ with reward domain $\1Y = [a, b]$, for any policy $\pi(x \mid s)$, its state-action value function $Q_{\pi} \in \Brackets{\underline{Q_{\pi}}(s, x), \overline{Q_{\pi}}(s, x)}$ for any state-action pair $(s, x) \in \1S \times \1X$, where bounds $\underline{Q_{\pi}}$ is a solution given by the following dynamic program,
	\begin{align}
		&\underline{Q_{\pi}}(s, x) = P(\neg x \mid s) \Big (a  + \gamma  \min_{s'} \underline{V_{\pi}}(s') \Big ) \label{eq:_2_lower_q_1} \\
		&+ P(x\mid s) \Big (\widetilde{\1R}\Parens{s, x} + \gamma \sum_{s'} \widetilde{\1T}\Parens{s, x, s'} \underline{V_{\pi}}(s') \Big ) \label{eq:_2_lower_q_2}
	\end{align}
    Similarly, the upper bound $\overline{Q_{\pi}}$ is a solution given by
    \begin{align}
		&\overline{Q_{\pi}}(s, x) =  P(\neg x \mid s) \Big (b  + \gamma  \max_{s'} \overline{V_{\pi}}(s') \Big )  \label{eq:_2_upper_q_1} \\
		&+ P(x\mid s) \Big (\widetilde{\1R}\Parens{s, x} + \gamma \sum_{s'} \widetilde{\1T}\Parens{s, x, s'} \overline{V_{\pi}}(s') \Big ) \label{eq:_2_upper_q_2}
	\end{align}
\end{restatable}
Among the bounds in \Cref{thm:_2_2}, \Cref{eq:_2_lower_q_1} is a regularized term accounting for uncertainties when the intervention $\doo(x)$ is not observed in the offline data; \Cref{eq:_2_lower_q_2} is the standard iterative step of the Bellman equation in \Cref{eq:_2_bellman_q}, weighted by the score $P\Parens{x \mid s}$. Since \Cref{thm:_2_1,thm:_2_2} are closed-form solutions of optimization programs and the observational constraints in \Cref{eq:_2_nb_t,eq:_2_nb_r} are tight, the extended Bellman's equation bounds are optimal from offline data and Markov property. This means that these bounds cannot be further improved without additional assumptions.