We first describe a distribution over a set of 3-states hard problem instances. Then by Yao's principle, we hos that any deterministic algorithm that performs well on my of such instances, will fail on another one. 

Consider a set of MDP with 3 states $s_0, s_1, s_2$, and a set of actions $\gA$. Without loss of generality, we set $s_0$ to be the initial state. The reward function is defined as following. On the initial state $s_0$ and $s_2$, all action will leads to a zero rewards. On state $s_1$, all action will incur a reward of 1. The transition dynamic of the MDP is defined as follow. On state $s_0$, a special action $a^\ast$ will leads to $s_1$ with probability $0.5 + \epsilon$ and to $s_2$ with probability $0.5 - \epsilon$. For all other actions, from $s_0$, will leads to either $s_1$ or $s_2$ each with probability $1/2$. States $s_1$ and $s_2$ are self-looping, in the sense that taking any actions on these two states will leads back to it self. We remark that by constructions of these MDPs, they can be reduced to a set of bandits instance where there is one unique arm that has rewards $0.5 + \epsilon$. The only difference is that the regret now scales up by $H - 1$.

With a $(s,a)$-uncertainty set of radius $\rho$ \dong{does this work with $s$-rectangular?}, the MDP that corresponds to the distributional robust value is the following. This robust MDP will instead let the special action $a^\ast$ transits $s_1$ with probability $0.5 + \epsilon - 0.5 \rho$ and to $s_2$ with probability $0.5 - \epsilon + 0.5\rho$ from $s_0$. \dong{This is so because we required the robust MDP's transition kernel to be absolutely continuous with respect to the original one. So the transition of $s_1$ and $s_2$ to themselves have to remain unchanged. } By the same deduction logic, this set of MDPs are equivalent to a set of bandits instance where there is one unique arm that has rewards $0.5 + \epsilon - 0.5\rho$. The only difference is that the regret now scales up by $H - 1$.

For an algorithm $\gM$, let $\pi$ be the policy it learnt with access to only the nominal model and $\pi^\prime$ be the policy it learnt on the robust transition. Let $V^{\pi}(s_0)$ denotes the robust value function of policy $\pi$ and let  $V^{\pi^\prime}(s_0)$ be the robust value function under policy $\pi^\prime$. There are only two kinds of policy available for $\pi$ and $\pi^\prime$. One is to choose $a \neq a^\ast$ on $s_0$, if both policy take the non-special actions, then  $V^{\pi}(s_0) = V^{\pi^\prime}(s_0) = 0$. The other kind of policy is to choose the special action $a^\ast$. In such case, we have $V^{\pi^\prime}(s_0) - V^{\pi}(s_0) = (H - 1) ( 0.5 +  \epsilon - 0.5 \rho )( \pi(a^\ast \mid s_0) - \pi^\prime(a^\ast \mid s_0)) \geq (H - 1) ( 0.5 \rho - \epsilon - 0.5)$.
%$V^{o, \pi}(s_0) - V^\pi(s_0) = 0.5 (H-1) \rho\cdot \pi(a^\ast \mid s_0) $. 
We let $V^\ast(s_0) = \max_\pi \min_{P \in \gP} V^{\pi, P}(s_0)$ to be the optimal robust value under the robust transition kernel and $V^{o,\ast}(s_0) = \min_{P \in \gP} \max_\pi $. Note that we also have $V^\ast(s_0) - V^{o,\ast}(s_0) = 0$, as the worst case transition kernel 4
et of robust MDPs, let us pick the special action $a^\ast$ uniformly random among all actions. For an algorithm $\gM$, let $\pi$ be the policy it learnt with access to only the nominal model and $\pi^\prime$ be the policy it learnt on the robust transition. Then the regret is 
\begin{align*}
     \text{Regret}^\pi(K) 
     = \ & V^\ast(s_0) - V^{\pi}(s_0)\\
     = \ & V^\ast(s_0) - V^{\pi^\prime}(s_0) + V^{\pi^\prime}(s_0) -  V^{\pi}(s_0)\\
     = \ & V^{o, \ast}(s_0) - V^{\pi^\prime}(s_0) + V^{\pi^\prime}(s_0) - V^\pi(s_0)\\
     \geq \ & (H - 1) ( 0.5 \rho - \epsilon - 0.5) + V^{o, \ast}(s_0) - V^{\pi^\prime}(s_0) \,.
\end{align*}
By standard bandits lower bound \cite{auer2002nonstochastic} (Theorem A.2), we have the 
\begin{align*}
    V^{o, \ast}(s_0) - V^{\pi^\prime}(s_0)  = \Omega \left( HK (\epsilon - \rho) \left( 1 - (\epsilon - \rho) \sqrt{\frac{K}{A}} \right) \right) \,.
\end{align*}
For $K \geq 4$, 
\begin{align*}
    \text{Regret}(K) 
    \geq \ & K(H - 1) ( 0.5 \rho - \epsilon - 0.5) +  \Omega \left( HK (\epsilon - \rho) \left( 1 - (\epsilon - \rho) \sqrt{\frac{K}{A}} \right) \right) \\
    \geq \ & K\left( \frac{1}{\sqrt{K}} + \epsilon - \rho\right) + \Omega \left( HK (\epsilon - \rho) \left( 1 - (\epsilon - \rho) \sqrt{\frac{K}{A}} \right) \right) \,.
\end{align*}
Take $\epsilon = (1-\rho) \sqrt{\frac{A}{K}} + \rho$, then we have
\begin{align*}
    \text{Regret}(K) 
    \geq \Omega \left( \rho(1- \rho) H \sqrt{AK} \right)\,.
\end{align*}
%Notice that because $\rho \leq 1$, take $\epsilon = \rho \sqrt{\frac{A}{K}} + \rho$ we have
%\begin{align*}
%    \text{Regret}(K) = \max_{\pi} \text{Regret}^\pi(K) = \Omega(H ( 0.5 \rho - \epsilon ) + \rho H\sqrt{A K} )\,.
%\end{align*}

