\section{Minimax Control}\label{sec:control} %

By following the insights of Lemma~\ref{lem:sharpness}, an MPC algorithm can be ideally conservative by choosing the action trajectory with the highest expected-reward lower bound admitted by the hidden-confounding constraints. The controller is said to be minimax-optimal if its expected discounted reward from taking actions in every time step achieves the maximum out of all controllers' total worst-case scenarios. 


Recall that the outcome optimized by MPC is, in theory, $Y=\sum_{t=0}^\infty \gamma^t R_t$ by its selection of infinite-horizon action trajectories $\Pi=[A_0\ A_1\ A_2\ \cdots]$, using the observable state representation $X= [O_0\ \ O_{-1} \ A_{-1}\ \ O_{-2} \ A_{-2}\ \cdots]$. We take a perspective of stochastic control of an uncertain system~\citep{bertsekas_rl}. Uncertainty is the induced hidden confounding when the controller uses $X$ instead of $S_0$.

Let $\mathcal{U}(A_0,X)$ denote the set of values an uncertainty variable $U$ can take such that the actual instantaneous reward $R_0$ and state transition $X'\triangleq [O_1\ \ O_{0}\ \cdots]$ from any action $A_0=a$ at current state representation $X=x$ is indexed by conditioning $U$ on some value $u\in\mathcal{U}(A_0,X)$. Further, let that mapping be bijective: any such $u$ must induce an admissible reward and state transition. In that case, the minimax controller must satisfy the Bellman equation
\begin{multline}\label{eq:bellman-uncertain-optimal}
    V^*(x) = \max_{a\in\mathcal{A}} \inf_{u\in\mathcal{U}(a,x)} \\
    \E\big[R_0 + \gamma V^*(X') \bigm| A_0=a, X=x, U=u\big].
\end{multline}
The sensitivity model ultimately places a constraint on how much the rewards of a controller's actions can vary from those that it predicted (by an offline world model). We assume that this is reflected in $U$.
The value function of a stationary, deterministic policy $f:\mathcal{X}\to \mathcal{A}$ can be written as
\begin{multline}\label{eq:bellman-uncertain-policy}
    V_f(x) = \inf_{u\in\mathcal{U}(f(x),x)} \\
    \E\big[R_0 + \gamma V_f(X') \bigm| A_0=f(x), X=x, U=u\big].
\end{multline}
Our theoretical MPC approach is such a policy $f$. It projects jointly sampled trajectories and then takes the first action of the best action trajectory from the closed set $\mathcal{T}$.
By Lemma~\ref{lem:sharpness}, our $f_\text{MPC}(x)$ solves
\begin{equation*} %
     \max_{a_0:\pi=[a_0\ a_1\ \cdots]\in\mathcal{T}} \E\big[Y \tilde g_\pi^{(-)}\mid \Pi=\pi, X=x\big].
\end{equation*}
The quantity being maximized provides a lower bound on $\E[Y(\pi)\mid X=x]$, where $Y(\pi)$ follows the discounted-reward distribution from following the action trajectory $\pi$. Subsequent state transitions after $a_0$ \emph{and} reward uncertainty are already encapsulated in $Y(\pi)$, by implication of Definition~\ref{def:potential-outcome}.


Plugging $f_\text{MPC}$ into Equation~\ref{eq:bellman-uncertain-policy}, it can be seen that maximizing the sharp bound on the expected discounted rewards of the projected trajectories also maximizes $V_{f_\text{MPC}}$. Details are provided in \S\ref{app:proof-control}.
We formalize this in the following lemma: 
\begin{lemma}[Minimax Control]\label{lem:control} %
  The proposed partially identified MPC described by $f_\text{MPC}$ reaches the minimax value,
  \begin{equation*}
    V_{f_\mathrm{MPC}}(X) = V^*(X) \quad \text{almost everywhere.}
  \end{equation*}
\end{lemma}

Lemma~\ref{lem:control} illustrates the capability of the controller to achieve optimal rewards in the minimax sense. 



\begin{algorithm}
  \caption{Partially Identified MPPI (single step)}\label{alg:mppi}
  \KwIn{dynamics models $\hat P_{\Pi|X}$ and $\hat P_{Y|\Pi,X}$, decision-making context $x\in\mathcal{X}$,  sensitivity~parameter $\Gamma\geq 1$ }
  \KwOut{best action $\hat a_0\in\mathcal{A}$}
  Sample i.i.d action trajectories $\pi^{(1)}, \pi^{(2)}, \pi^{(3)},\dots$ according to $\hat P_{\Pi| X}(\pi\mid x)$\;
  \ForEach{search iteration}{
    \ForEach{action trajectory $\pi^{(i)}$}{
      Estimate bounds for density ratio $\tilde g_{\pi^{(i)}}$ as $\hat\E\big[\Gamma^{\pm\norm{\Pi-\pi^{(i)}}}\bigm| X=x\big]$ using the $\pi$-sample\;
      Sample i.i.d reward trajectories $y^{(i,1)}, y^{(i,2)} \dots$ according to $\hat P_{Y|\Pi,X}(y\mid \pi^{(i)},x)$\;
      Estimate reward lower bound ${\hat y^{(i)}\triangleq \hat\E\big[Y\tilde g^{(-)}_{\pi^{(i)}}\mid \Pi=\pi^{(i)}, X=x\big]}$\;
    }
    Update policy estimate $\hat\pi$ using action-reward pairs $(\pi^{(1)}, \hat y^{(1)}), (\pi^{(2)},\hat y^{(2)}) \dots$ as in classic MPPI\;
    Resample i.i.d action trajectories $\pi^{(1)}, \pi^{(2)} \dots$ according to the current policy estimate $\hat \pi$\;
  }
  Select and return first action $\hat a_0$ from policy estimate $\hat\pi$\;
\end{algorithm}


\subsection{Implementation}
We present a concrete implementation of MPC with our partial identification strategy. State-of-the-art model-based RL algorithms~\citep{hansen2024tdmpc,hu2023planning} tend to use a variant called model predictive path integral (MPPI)~\citep{williams2015model}. MPPI operates on samples of future dynamics by ranking and weighting action trajectories based on their projected rewards. We assume access to a generative model for the conditional distributions of action trajectories $P_{\Pi|X}$ and rewards $P_{Y|\Pi,X}$. In practice, the infinite horizon for discounted rewards needs to be approximated by a sufficiently long finite horizon, perhaps with a terminal value estimator if necessary.


Algorithm~\ref{alg:mppi} augments MPPI by lower-bounding the expected reward through Lemma~\ref{lem:sharpness} for each sampled action trajectory. Classic MPPI tends to use the reward sample directly. We adopt the same heuristics for ranking and weighting trajectories as in other works. During each search iteration, MPPI updates its policy estimate $\hat\pi$, which tends to be approximated as a multivariate Gaussian with diagonal covariance across time steps. In line~7 of Algorithm~\ref{alg:mppi}, the means and variances are estimated with the top-$k$ trajectories, with weights computed through a softmax on the reward lower-bound estimates. In line~8, the action-trajectory sample is replaced with a sample of this Gaussian policy estimate, so that subsequent iterations further refine the distribution.
