In this section, we describe the robust MDP and start with defining some notations.
\paragraph{Robust Markov decision process}
We consider an episodic finite horizon tabular robust MDP, which can be denoted by a tuple $\gM = \langle \gS, \gA, H, $ $\{\gP_h\}_{h=1}^H, \{r\}_{h=1}^H \rangle$. Here $\gS$ is the state space, $\gA$ is the action space, $\{r\}_{h=1}^H$ is the time-dependent reward function, and $H$ is the length of each episode. Instead of a fixed uncertainty kernel, the transitions of the robust MDP are governed by kernels from a time-dependent uncertainty set $\{\gP_h\}_{h=1}^H$, $\ie$, time-dependent transition $P_h \in \gP_h \subseteq \Delta_{\gS}$ at time $h$. 

The uncertainty set $\gP$ is constructed around a nominal transition kernel $\{P_h^o\}$, and all transition dynamics within the set are close to the nominal kernel with a distance metric of one's choice. Different from an episodic finite-horizon non-robust MDP, the transition kernel $P$ be chosen (even adversarially) from a specified time-dependent uncertainty set $\gP$. We consider the case where the rewards are stochastic. This is, on state-action $(s,a)$ at time $h$, the immediate reward is $R_h(s,a) \in [0,1]$, which is drawn i.i.d from a distribution with expectation $r_h(s,a)$.
With the described setup of robust MDPs, we now define the policy and its associated value.




\paragraph{Policy and robust value function}
A time-dependent policy $\pi$ is defined as $\pi = \{\pi_h\}_{h=1}^H$, where each $\pi_h$ is a function from $\gS$ to the probability simplex over actions, $\Delta(\gA)$. 
If the transition kernel is fixed to be $P$, the performance of a policy $\pi$ starting from state $s$ at time $h$ can be measured by its value function, which is defined as 
\begin{align*}
    V_h^{\pi, P}(s) = \mathbb{E}_{\pi, P}\left[ \sum^H_{h^\prime = h} r_{h^\prime}(s_{h^\prime}, a_{h^\prime}) \mid s_h = s\right] \,.
\end{align*}
In robust MDP, the robust value function instead measures the performance of $\pi$ under the worst possible choice of transition $P$ within the uncertainty set. Specifically, the value and the Q-value function of a policy given the state action pair $(s,a)$ at step $h$ are defined as 
\begin{align*}% \label{eq:robust_val}
    V^{\pi}_h (s) = \ & \min_{\{P_h\} \in \{\gP_h\}} V^{\pi, \{P_h\}}_h (s) \,, \nonumber \\ 
    Q^{\pi}_h (s,a) = \ & \min_{\{P_h\} \in \{\gP_h\}} \mathbb{E}_{\pi, \{P_h\}}\Biggl[\sum^H_{h^\prime =h} r_h (s_{h^\prime} , a_{h^\prime} )  \mid (s_h,a_h) = (s, a)\Biggl] \,.
\end{align*}
The optimal value function is defined to be the best possible value attained by a policy $V^{\ast}_h (s) = \max_{\pi} V^{\pi}_h (s) = \max_{\pi} \min_{\{P_h\} \in \{\gP_h\}} V^{\pi, \{P_h\}}_h (s) $.
The optimal policy is then defined to be the policy that attains the optimal value.

\paragraph{Robust Bellman equation}
Similar to non-robust MDP, robust MDP has the following robust bellman equation, which characterizes a relation to the robust value function \cite{wiesemann2013robust,ho2021partial,yang2021towards,behzadian2021fast}.
\begin{align*}%\label{eq:robust_bellman}
    Q^{\pi}_h (s,a) = \ & r(s,a) + \sigma_{\gP_h}(V_{h+1}^\pi)(s,a)\,, \\
    V^{\pi}_h (s) = \ & \langle Q^{\pi}_h (s,\cdot), \pi_h(\cdot, s) \rangle \,,
\end{align*} where
\begin{align}\label{eq:sigma}
    \sigma_{\gP_h}(V_{h+1}^\pi)(s,a) = \ & \min_{P_h \in \gP_h} \limits P_h(\cdot \mid s,a) V_{h+1}^\pi \,, \nonumber\\
    P_h(\cdot \mid s,a) V = \ &\sum_{s^\prime \in \gS} \limits P_h(s^\prime \mid s,a) V(s^\prime)\,.
\end{align}
%\vspace{-0.1cm}
Without additional assumptions on the uncertainty set, the optimal policy and value of the robust MDP are in general NP-hard to solve \cite{wiesemann2013robust}. One of the most common assumptions that make solving optimal value feasible is the rectangular assumption \cite{iyengar2005robust,wiesemann2013robust,badrinath2021robust,yang2021towards,panaganti2022sample}. 
\paragraph{Rectangular uncertainty sets}
To limit the level of perturbations, we assume that the transition kernel is close to the nominal transition measured via $\ell_1$ distance. We consider two cases.

The $(s,a)$-rectangular assumption assumes that the uncertain transition kernel within the set takes value independently for each $(s,a)$. We further use $\ell_1$ distance to characterize the $(s,a)$-rectangular set around a nominal kernel with a specified level of uncertainty.
\begin{defn}[$(s,a)$-rectangular uncertainty set \cite{iyengar2005robust,wiesemann2013robust}]\label{def:sa}
For all time step $h$ and with a given state-action pair $(s,a)$, the $(s,a)$-rectangular uncertainty set $\gP_h(s,a)$ is defined as 
\begin{align*}
    \gP_h(s,a) = \ &  \ \left\{\left\|P_h(\cdot \mid s,a) - P_h^o(\cdot \mid s,a) \right\|_1 \leq \rho, \right. \\
    & \ \left. P_h(\cdot \mid s,a) \in \Delta(\gS) \right\}\,, 
\end{align*}
where $P_h^o$ is the nominal transition kernel at $h$, $P_h^o(\cdot \mid s,a) \geq c > 0, \forall (s,a) \in \gS \times \gA$, $\rho$ is the level of uncertainty and $\Delta(\gS)$ denotes the probability simplex over the state space $\gS$.
\end{defn}
With the $(s,a)$-rectangular set, it is shown that there always exists an optimal policy that is deterministic \cite{wiesemann2013robust}. 

One way to relax the $(s,a)$-rectangular assumption is to instead let the uncertain transition kernels within the set take value independent for each $s$ only. This characterization is then more general and its solution gives a stronger robustness guarantee. 
\begin{defn}[$s$-rectangular uncertainty set \cite{wiesemann2013robust}]\label{def:s}
For all time step $h$ and with a given state $s$, the $s$-rectangular uncertainty set $\gP_h(s)$ is defined as 
\begin{align*}
    \gP_h(s) = \ & \biggl\{ \sum_{a \in \gA}\left\|P_h(\cdot \mid s,a) - P_h^o(\cdot \mid s,a) \right\|_1 \leq A \rho \,, \\
    & \ P_h(\cdot \mid s,\cdot) \in \Delta(\gS)^{\gA}  \biggl\} \,,
\end{align*}
where $P_h^o$ is the nominal transition kernel at $h$, $P_h^o(\cdot \mid s,a) > 0, \forall (s,a) \in \gS \times \gA$, $\rho$ is the level of uncertainty, and $\Delta(\gS)$ denotes the probability simplex over the state space $\gS$.
\end{defn}
Different from the $(s,a)$-rectangular assumption, which guarantees the existence of a deterministic optimal policy, the optimal policy under $s$-rectangular set may need to be randomized \cite{wiesemann2013robust}. We also remark that the requirement of $P_h^o(\cdot \mid s,a) > 0$ is mostly for technical convenience. 

Equipped with the characterization of the uncertainty set, we now describe the learning protocols and the definition of regret under the robust MDP. 

\paragraph{Learning protocols and regret}
We consider a learning agent repeatedly interacts with the environment in an episodic manner, over $K$ episodes. 
At the start of each episode, the learning agent picks a policy $\pi_k$ and interacts with the environment while executing $\pi_k$. 
Without loss of generality, we assume the agents always start from a fixed initial state $s$. The performance of the learning agent is measured by the cumulative regret incurred over the $K$ episodes. Under the robust MDP, the cumulative regret is defined to be the cumulative difference between the robust value of $\pi_k$ and the robust value of the optimal policy, 
$%\label{eq:regret}
    \text{Regret}(K) = \sum^K_{k=1} V_1^{\ast}(s_1^k) - V_1^{\pi_k} (s_1^k)
$,
where $s_1^k$ is the initial state in episode $k$.


We highlight that the transition of the states in the learning process is specified by the nominal transition kernel $\{P_h^o\}_{h=1}^H$, though the agent only has access to the nominal kernel in an online manner. We remark that if the agent is asked to interact with a potentially adversarially chosen transition from an arbitrary uncertainty set, the learning problem is NP-hard \cite{even2004experts}. 

One practical motivation for this formulation could be as follows. The policy provider only sees feedback from the nominal system, yet it aims to minimize the regret for clients who refuse to share additional deployment details for purposes such as privacy concerns. Thus the observed feedback describes the ``nominal transition'' while the unseen clients are represented by the ``uncertainty set''.

