
\textbf{Model-Based DR-RL Algorithm Design:}
Under the model-free setting, a general approach to design the DR-RL algorithm is to adopt Multi-Level Monte Carlo (MLMC) \cite{blanchet2019unbiased} method to estimate the dual value in the minimax problem. In detail, MLMC estimator requires $2^{N+1}$ samples to estimate the dual value, where $N\sim \text{Geo}(g)$, i.e. $\mathbb P(N=n):=g(1-g)^n$. However, there is a trade-off of the value $g$ between the variance of MLMC estimator and the expected total sample size. 
% If $g\in(0,\frac{1}{2}]$
For a smaller $g$, MLMC estimator draws enough samples to achieve a finite boundary of variance but infinite expected sample size e.g. $g\in(0,1/2)$ in \cite{liu2022distributionally}. For a larger $g$, MLMC estimator draws less samples to make sure the finite expected sample size i.e. $g\in (1/2,1)$,  but leads to infinite boundary of various or limited uncertainty radius. Here, \cite{wang2023finite} provides the finite sample complexity analysis of the MLMC KL-constrained DR-RL algorithm, where the analysis is based on the assumption that uncertainty set is sufficiently small, i.e. radius is less than minimum support (minimum non-zero probability) of nominal distribution. 
Thus, a natural idea to balance the trade-off in MLMC is to set a threshold of $N_{\max}$ i.e $$N'=\min\varbrac{N\sim \text{Geo}(g), N_{\max}},$$ then draw $2^{N'+1}$ samples. \cite{levy2020large} provides a similar MLMC approach to solve the DRO problems. Our threshold MLMC method has different formulation and can be adopted for different $g\in(0,1)$ instead of 
When $g=1/2$, the expected total sample size is equal to $N_{\max}$. There exists a trade-off between expected total sample size and dual value estimated bias. When $N_{\max}\to \infty$, the MLMC estimator unbiasedly estimate the dual value. In contract to larger $N_{\max}$, for smaller $N_{\max}$, the estimated bias of dual value is smaller but expected total sample size is larger.
Thus, combined with threshold MLMC method and choosing feasible value of $g$ and $N_{\max}$, we can design the model-free DR-RL algorithm which not only has the convergence guarantee but also has finite sample complexity. 

% \cite{liu2022distributionally} propose MLMC distributionally robust $Q$-learning algorithm with KL-constrained uncertainty set, which expected sample complexity is infinite. 
%   \cite{wang2023model} provides the  
% asymptotic convergence guarantee for model-free MLMC DR-RL algorithm.   \cite{wang2023finite} provides the finite sample complexity analysis of the Model-free KL-constrained DR-RL algorithm with a limitation uncertainty level, where the analysis is based on the assumption that uncertainty set is sufficiently small, i.e. radius is less than minimum support (minimum non-zero probability) of nominal distribution. 
%   Hence, there is a need for a model-free DR-RL algorithm and an associated sample complexity analysis that is adaptable to various constrained uncertainty sets and levels of uncertainty.

\textbf{Sample Complexity Analysis: }For DR-RL problems, the sample complexity analysis is important and challenging for RL research. The key role of sample size analysis is to study the sample size to solve the minimax problem. In model-based methods, the solution of duality problems is estimated via the empirical distribution deduced from a large number of samples. A commonly adopted approach \cite{shi2023curious,clavier2023towards} is to transfer the solution gap via nominal distribution and empirical distribution to the distance between the nominal distribution and empirical distribution by the fact:
\begin{align}\label{eq:eq1}
    &\lbrac{\max_\alpha f(X, \alpha)-\max_\alpha f(X', \alpha')}\nonumber
    \\& \qquad\leq \max_{\alpha} \lbrac{f(X', \alpha)-f(X, \alpha) }.
\end{align}
% Although this approach can provide the upper and lower bound in the setting of $TV$-constrained and $\chi^2$-constrained uncertainty set, for the KL-constrained uncertainty set, this approach will provide a loose bound which leads to the $\mathcal{O}(\exp(1-\gamma))$ term in the total sample size, where $\gamma$ is the discount factor in MDPs. Furthermore, 
However, the
 analysis based on \cref{eq:eq1} focus on the bounding the gap between nominal distribution and empirical distribution but ignores the statistical property of minimax problems, such as the detailed worst-case distribution over the $\rho$-constrained uncertainty set. However, these statistical properties are important in further DR-RL research, e.g. tighter bound of DR-RL sample size analysis and computational economic algorithm design in practice.
Expect the analysis method based on \cref{eq:eq1}, \cite{levy2020large} provides a different method to analyze the sample complexity of DRO problems which is applied in CVaR and $\chi^2$-constrained uncertainty set and hard to extend to another uncertainty set (e.g. TV or KL-constrained uncertainty set). Therefore, the study to statistical properties of the solution to DRO problems is required. Moreover, the analytical approach that adopt the properties of solutions to DRO problems is valuable and challenging in DR-RL.
% Furthermore, the analysis approach based on properties of the solution to DRO problems  is valuable and challenging in DR-RL problem.  
% Another approach to analyzing the sample size is to study the favorable statistical properties of the Lagrange function with the uncertainty set centered in the empirical distribution. Based on this method, \cite{levy2020large} provides the sample complexity to solve the minimax problems, which can easily extend to DR-RL. However, this approach . For , the theoretical result is open and hard to extend from the method in \cite{levy2020large}. 


% Thus, the sample complexity analysis of model-free DR-RL problems is still open.     

% Thus, 

% The sample analysis for some work \cite{liu2022distributionally} adopted model-free approach is based on the properties of 

% with expected infinite sample comple 


\begin{definition}[Biased estimation]\label{def:3.1}
    Draw $n$ samples from nominal distribution $x_i\sim p, i=0,1,.., n-1$ and get the empirical frequency $\widehat p_n $. We define (resp. $\hat \mu_n$)
    \begin{align}
        f^{*\rho(\vartheta)}(\hat p_n,V(s'_{s,a})):=\sup_{\alpha\geq 0} \varbrac{f^{\rho(\vartheta)}(\hat p_n, \alpha, V(s'_{s,a})) },\nonumber
    \end{align} and 
    \begin{align}
        \boldsymbol{f^*}^{\rho(\vartheta)}\brac{\hat p_n, V(s'_{s,a})}:=\mathbb E_{\hat{p}_n}\Fbrac{f^{*\rho(\vartheta)}(\hat p_n,V(s'_{s,a}))}.\nonumber
    \end{align}
\end{definition}



\begin{proposition}[Threshold MLMC]\label{prop:estimate}
The robust Bellman operator satisfies that 
\begin{align}
    \mathbb E \Fbrac{r_{s,a,0}+ \frac{\delta^{r,\rho(\vartheta)}_{s,a,N_1}}{P_{N_1}} }=& \boldsymbol{f^*}^{\rho(\vartheta)}\brac{\widehat \mu_{s,a,2^{N_{\max}+1}},\boldsymbol{id}(r_{s,a})  },\nonumber
\end{align}
and 
\begin{align}
    \mathbb E &\Fbrac{V(s'_{s,a,0})+\frac{\delta^{\rho(\vartheta)}_{s,a,N_2}(Q) }{P_{N_2}} }\\&\qquad\qquad=\boldsymbol{f^*}^{\rho(\vartheta)}\brac{\widehat p_{s,a,2^{N_{\max}+1}},V(s'_{s,a})  }.\nonumber
\end{align}
    % Sample $N$ from a geometric distribution $\text{Geo}(g)$,i.e. $\mathbb P(N=n)=p_n:= g(1-g)^n, n=0,1,...$ with threshold $N=\min\varbrac{N,N_{\max}}$. Then draw $2^{N+1}+1$ samples from nominal distribution $x_i\sim p, i=0,1,...,2^{N+1}$, 
\end{proposition}





Under model-free setting, the estimation of the robust Bellman operator is biased and the bias depends on empirical distribution sample sizes. 
We prove that when applying threshold $N_{\max}$ in our algorithm, the bias of the robust Bellman operator is equal to the bias when applying the model-based algorithm with sample size $2^{N_{\max}+1}$. 
Here, we describe the condition by following proposition. 
\begin{proposition}[Threshold MLMC]\label{prop:mlmc}
The robust Bellman estimator $\widehat{v}^{\rho(\sigma)}(Q(s,a))$ (resp. $\widehat r^{\rho(\sigma)}(s,a) $) satisfies that 
\begin{align}
    \E\Fbrac{\widehat{v}^{\rho(\sigma)}(Q(s,a)) }&=\mathbb E \Fbrac{V(s'_{s,a,0})+\frac{\delta^{\rho(\sigma)}_{s,a,N_2}(Q) }{P_{N_2}} \nonumber}\\&=\E\Fbrac{f^{*\rho(\sigma)}(\hat p_{s,a,2^{N_{\max}+1}},V(s'_{s,a}))}
    % \boldsymbol{\bar f^*}_{2^{N_{\max}+1}}^{\rho(\sigma)}\brac{p_{s,a},V(s'_{s,a})  }
    .\nonumber
\end{align}
\end{proposition}
The \cref{prop:mlmc} shows the fact that the estimation biases are equal when drawing $2^{N_{\max}+1}$ samples to estimate the dual value directly and when setting the $N_{\max}$-threshold MLMC algorithm to estimate the dual value. 

% When sampling $n$
Based on \cref{prop:mlmc}, for $\rho$ distance and uncertainty level $\sigma$, we define the define the surrogate $Q$-table $ \widehat Q^{*\rho(\sigma)}$ and robust optimal $Q$-table $Q^{*\rho(\sigma)}$ as
% estimation of estimated robust Bellman operator and robust Bellman operator following
\begin{align}
    \E\Fbrac{ {\widehat {\mathcal{T}}}_{N_{\max}}^{\rho(\sigma)} (\widehat Q^{*\rho(\sigma)})(s,a) }&= \widehat Q^{*\rho(\sigma)}(s,a),
    \nonumber
% \\& =\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{\hat \mu_{s,a,2^{N_{\max}+1}},\boldsymbol{id}(r_{s,a})  }\nonumber\\& \quad+\gamma\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{ \hat p_{s,a,2^{N_{\max}+1}}, V(s'_{s,a})},\nonumber
\\ { {\mathcal{T}}}^{\rho(\sigma)} (Q^{*\rho(\sigma)})(s,a)&
=Q^{*\rho(\sigma)})(s,a)
  % =\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{ \mu_{s,a},\boldsymbol{id}(r_{s,a})  }\nonumber\\& \quad+\gamma\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{  p_{s,a}, V(s'_{s,a})}
  .\label{eq:fixp}
\end{align}

We do error decomposition by the surrogate $Q$-table  $ \widehat Q^{*\rho(\sigma)}$: 
\begin{align}\label{eqeq20}
    {\mynorm{\widehat Q_T^{\rho_{TV}(\sigma)}-Q^{*\rho(\sigma)}}_\infty^2}&
  \leq 2 {\mynorm{\widehat Q_T^{\rho(\sigma)}-\widehat Q^{*\rho(\sigma)}}_\infty^2 }\nonumber\\\quad+ 2&{\mynorm{\widehat Q^{*\rho(\sigma)}-Q^{*\rho(\sigma)}}_\infty^2 }.
\end{align}

\textbf{Second term in \cref{eqeq20}: }
The second term in\cref{eqeq20} is the gap between the fixed points in \cref{eq:fixp}, 
% Given $Q(s,a)$, the gap between the robust Bellman operator $\boldsymbol{\widehat {\mathcal{T}}}^{\rho(\sigma)} (Q)(s,a)$ and estimated robust Bellman operator $\boldsymbol{ {\mathcal{T}}}^{\rho(\sigma)} (Q)(s,a)$ 
which can be bound following the methods in model-based works \cite{shi2023curious,yang2022toward} combined with \cref{prop:mlmc}. 
% The surrogate biased estimated optimal $Q$-table  $ \widehat Q^{*\rho(\sigma)}$ satisfies the following equation:
% \eqenv{\label{eqeq23}
% \widehat Q^{*\rho(\sigma)}(s,a)=  \boldsymbol{\widehat {\mathcal{T}}}^{\rho(\sigma)} (\widehat Q^{*\rho(\sigma)})(s,a).
% }
% Combined with the robust Bellman equation, 
% \begin{align}
%       Q^{*\rho(\sigma)}(s,a)=  \boldsymbol{ {\mathcal{T}}}^{\rho(\sigma)} (  Q^{*\rho(\sigma)})(s,a),
% \end{align}
% we can make error decomposition as following 
% % $ {\mynorm{\widehat Q_T^{\rho(\sigma)}-\widehat Q^{*\rho(\sigma)}}_\infty^2 }$.
% \eqenv{
% &{\mynorm{\widehat Q^{*\rho(\sigma)}-Q^{*\rho(\sigma)}}_\infty^2 }\\& \leq
% \norminf{\hatT\brac{\widehat Q^{*\rho(\sigma)}}-\boldsymbol{\widehat{\mathcal{T}}}^{\rho(\sigma)}\brac{Q^{*\rho(\sigma)}} }
% \\& \quad+ \norminf{\hatT\brac{ Q^{*\rho(\sigma)}}-\mathcal{T}^{\rho(\sigma)}\brac{Q^{*\rho(\sigma)}}},
% } 


\textbf{First term in \cref{eqeq20}: Variance Bound}
Next, we present the bound of ${\mynorm{\widehat Q_T^{\rho(\sigma)}-\widehat Q^{*\rho(\sigma)}}_\infty } $. The error between surrogate $Q$-table and the optimal robust $Q$-table, $ {\mynorm{ Q^{*\rho(\sigma)}-\widehat Q^{*\rho(\sigma)}}_\infty^2 }$ can be bounded by 1) $\E\Fbrac{\widehat T _{N_{\max}}^{\rho(\sigma)}(Q)}$ is $\gamma$ contraction operator and 2) the estimation of robust Bellman operator can be bounded by \cref{prop:mlmc}. Here, the boundary of the MLMC estimator variance $\widehat {\mathcal{T}}_{N_{\max}}^{\rho(\sigma)}(Q)(s,a)$ is required to make sure the convergence of algorithm. Take the expectation of $N\sim \text{Geo}(1/2)$, the variance can be bounded by
$\mathcal{O}\brac{\sum_{N_1=0}^{N_{\max}} \frac{\brac{\delta^{r,\rho(\sigma)}_{s,a,N_1}
  }^2}{P_{N_1}}+\sum_{N_2=0}^{N_{\max}}\frac{\brac{
\delta^{\rho(\sigma)}_{s,a,N_2}(Q)  }^2}{P_{N_2}}}$. 

Then, make the decomposition of the term $\brac{
\delta^{\rho(\sigma)}_{s,a,N_2}(Q)  }^2$, 
\eqenv{
&\lbrac{\delta^{\rho(\sigma)}_{s,a,N}(Q)(s,a) }^2
    \\& \leq 3\lbrac{\sup_{\alpha\geq 0}\varbrac{f^{\rho(\sigma)}(\widehat p_{s,a,2^{N+1}},\alpha,V)}-\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{\hat p_n, V(s'_{s,a})}  }^2
    \\&\quad+  \frac{3}{2}\lbrac{ \sup_{\alpha\geq 0}\varbrac{f^{\rho(\sigma)}(\widehat p^E_{2^{N}},\alpha,V)}-\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{\hat p_n, V(s'_{s,a})}}^2
    \\&\quad+\frac{3}{2}\lbrac{\sup_{\alpha\geq 0}\varbrac{f^{\rho(\sigma)}(\widehat p^O_{2^{N}},\alpha,V )}-\boldsymbol{\bar f^*}^{\rho(\sigma)}\brac{\hat p_n, V(s'_{s,a})}}^2.\label{eqeq24}
    % \\& \myineq{}{}
}

Then the terms in \cref{eqeq24} can be bounded by the similar way in model-based approach \cite{shi2023curious,yang2022toward}. 
% Making the decomposition and we get that
% \eqenv{
% \E&\Fbrac{\norminf{\widehat Q_{t+1}^{\rho_{TV}(\sigma)}-\widehat Q^{*\rho(\sigma)} }^2}
% \\&\qquad\leq\brac{1-2(1-\gamma)\alpha}\E\Fbrac{\norminf{\widehat Q_{t}^{\rho_{TV}(\sigma)}-\widehat Q^{*\rho(\sigma)}   }^2} 
% \\&\qquad\quad+ \alpha^2 \max_{s,a}\Varr{\widehat {\mathcal{T}}^{\rho(\sigma)} (Q)(s,a)}.
% }
% We analyze the term  $\E\Fbrac{\twonormsq{\widehat {\mathcal{T}}^{\rho(\sigma)} (Q)(s,a)}} $ to bound the variance $\widehat {\mathcal{T}}^{\rho(\sigma)}(Q)(s,a)$.
% Take the expectation of $N\sim \text{Geo}(1/2)$, we get 

Thus, the term $\widehat {\mathcal{T}}^{\rho(\sigma)}(Q)(s,a)$ can be bounded by a constant.  Combined with the fact that $\widehat {\mathcal{T}}^{\rho(\sigma)}(Q)(s,a)$ is $\gamma$-contraction, we can get the boundary of
$ {\mynorm{ Q^{*\rho(\sigma)}-\widehat Q^{*\rho(\sigma)}}_\infty^2 }$. 