\documentclass{article}
\usepackage{amsmath,amssymb}

\usepackage{color}
\newcommand{\zou}[1]{{\color{blue}{Zou: #1}}}
\begin{document}

For numerical result:

To reviewer: 1
Dear Reviewer,

We have updated the results. In the link https://anonymous.4open.science/r/UAI_278_results_-B68A/ , you will find the results for the FrozenLake environment, Gambler's game, and the recycling robot. Here, the baseline label in the figures represents the robust optimal policy when the transition kernel is applied directly.

In these figures, the y-axis represents the expected cumulative reward $J^\pi= \mathbb{E}[V^\pi_{\mathbf{P}_0,\mathbf{R}_0}(s)]$. 

For the results where T-MLMC is compared with the model-based method and vanilla MLMC, the x-axis represents the number of samples. We set the model-based method's sampling at $2^{N_{\max}+1}$ samples per step, where both T-MLMC and the model-based method will converge to the same $\epsilon$-accurate optimal policy.  From these results, we can observe that our T-MLMC algorithm is computationally economical, which is also an advantage of model-free methods. 


Furthermore, additional simulation results demonstrate the convergence of our T-MLMC. When choosing the suitable $N_{\max}$ the bias introduced by the T-MLMC estimator will not lead to instability of the system. The algorithm will converge to the $\epsilon$-accurate optimal robust policy.

To reviewer 5cr9: 
\textbf{***Weaknesses 1.a and b: the even- and odd-indexed samples in MLMC is just the standard approach, why this would work on a high level;  the "correction terms" seem under-motivated***}
We appreciate your suggestions and will update the writing accordingly. Here we provide an intuitive explanation of MLMC algorithm. The idea is that for the maximum likelihood estimate $\hat{p}_N$ using $N$ samples, when $N\rightarrow\infty$, the estimator is accurate: $\hat{p}_\infty=p$. Thus, we have that $f(p)=f(\hat{p}_\infty)$. It hence allows us to rewrite the function as $f(p)=f(\hat{p}_\infty)= f(\hat{p}_{2^0})-f(\hat{p}_{2^0})+f(\hat{p}_{2^1})-f(\hat{p}_{2^1})+...=f(\hat{p}_{2^0})+\sum_{N=1}^\infty f(\hat{p}_{2^N})- f(\hat{p}_{2^{N-1}})=f(\hat{p}_{2^0})+\sum_{N=1}^\infty P_N \frac{f(\hat{p}_{2^{N+1}})- f(\hat{p}_{2^{N}})}{P_N}= f(\hat{p}_{2^0})+\mathbb{E}_{N\sim P_N}[\frac{f(\hat{p}_{2^{N+1}})- f(\hat{p}_{2^N})}{P_N}] $ for some distribution $P_N$. It hence suffices to obtain an unbiased estimator of $\mathbb{E}_{N\sim P_N}[\frac{f(\hat{p}_{2^{N+1}})- f(\hat{p}_{2^N})}{P_N}]$. Hence, we can straightforwardly sample $N\sim P_N$ and construct the "correction term" $\delta^{r, \rho(\sigma)}_{s,a,N}=\frac{f(\hat{p}_{2^{N+1}})- f(\hat{p}_{2^{N}})}{P_{N}} $ to be an unbiased estimator of $\mathbb{E}_{N\sim P_N}[\frac{f(\hat{p}_{2^{N+1}})- f(\hat{p}_{2^N})}{P_N}]$. 


%, where the term $f(\hat{p}_{2^{N_1}})$ is estimated by $ \frac{f(\hat{p}^O_{2^{N_1}})+f(\hat{p}^E_{2^{N_1}})}{2}$.


The partition of sample indies are not essential. For any $k$ we have that $f(\hat{p}_\infty)= f(\hat{p}_{k^0})+\mathbb{E}_{N\sim P_N}[\frac{f(\hat{p}_{k^{N+1}})- f(\hat{p}_{k^N})}{P_N}]$, and the induced MLMC is also unbiased. We hence follow the standard formulation of MLMC to choose $k=2$. 


%However, this change will increase the variance of the MLMC method. Thus, the choice $k=2$ is widely accepted in MLMC. 

% There are no extra requirements for index splitting. Any other splitting that can divide the data set $\hat{p}^{2^{N_1+1}}$ into two different data sets can be accepted. 
 



\textbf{***Weaknesses 1.c: Is adding the $N_{\max}$ threshold the only change made by this paper? A few sentences comparing the current design against previous designs would be helpful.***}

In algorithm design, the key modification compared to previous MLMC algorithms is the inclusion of a threshold, but it leads to substantial new development in the theoretical analysis. Previous MLMC algorithms necessitate infinite samples [Liu, et.al., 2022] or rely on a strong assumption ($\frac{p_\wedge}{2}\geq 1-e^{-\sigma}$) [Wang, et. al., 2023a, b] to ensure the MLMC is unbiased with bounded variance. 
Our threshold-based design circumvents these issues, albeit at the expense of introducing bias into the estimator.  We characterize the exponentially small probability of the difference between T-MLMC and MLMC occurs, and carefully analyze the deviation introduced by this bias. We showed that with a carefully designed threshold, the deviation is small and our biased algorithm converges to a close neighbour of the optimal policy, requiring fewer samples and no additional assumptions. This underscores the effectiveness of our design and its significant advantages.  


%Since the existence of the threshold and lack of assumption, our work design a new approach to analyze the sample complexity. 


%To bound the bias due to threshold, we design a surrogate Q table to analyze the convergence of algorithm. Without the assumption constraining the uncertainty level, we bound the variance of T-MLMC estimator based on the statistical property. 
%\zou{provide details about the assumption difference between previous work and our paper}
%\zou{explain what is new in order to analyze the bias}

 

 %ur algorithm provides analysis of the model-free threshold MLMC algorithm, which is a different method from existing works. We firstly provide the model-free MLMC sample complexity for TV and $\chi^2$ uncertainty set.  Moreover, our analysis method can avoid the extra assumptions in MLMC.  

\textbf{***Weaknesses 1.d: an introductory paragraph stating the high-level intuition behind MLMC would also be very helpful***}

Thanks for your comments. We will add the discussion in Weaknesses 1. a,b, and introductory paragraph in the revised version. 


\textbf{***Weaknesses 2. a: Storing the estimated transition kernel indeed requires $S^2A$ space, but storing the Q-function also requires $SA$ space, which is not significantly better than the former when $S$ is infinitely large. ***} 
One major advantage of model-free method is that it can eliminate the need of explicitly estimating the empirical transition kernel and can be implemented in an online fashion, which saves a factor of $S$ in the memory cost. 
In many practical applications, the state space, though being discrete, can still be extremely large and such a reduction results in a great improvement in memory efficiency. There have been quite a few works delicate to find the optimal dependency of the complexity on $S$, as listed in [a,b] for online learning setting. 

%. For example, in the case of vanilla $Q$-learning sample complexity analysis, the table in [a] indicates that the improvement in the order of $S$ is substantial rather than incremental.

%For the offline setting, the table in [b] lists related works, highlighting that the improvement in the order of $S$ is a crucial aspect.




\zou{in our discussion, we said that we should add some references on standard mdp that eliminating $S$ is a big thing. It would be great if we can point to a table in some paper that some paper improve the order by a factor of $S$}

[a] Li, Gen, et al. "Is Q-learning minimax optimal? a tight sample complexity analysis." Operations Research 72.1 (2024): 222-236.

[b] Zhang, Zihan, et al. "Settling the sample complexity of online reinforcement learning." arXiv preprint arXiv:2307.13586 (2023).

\textbf{***Weaknesses 2. b and d: whether [Panaganti $\&$ Kalathil, 2022] is model-based or model-free? leveraging the intrinsic low-rankness of MDPs and introducing some kind of representation***}

% Here we notice there are two different work [Panaganti $\&$ Kalathil, 2022a] and [Panaganti $\&$ Kalathil, 2022b]. In this paper, we cite the work [Panaganti $\&$ Kalathil, 2022a] in tables, which is a model-based work. [Panaganti $\&$ Kalathil, 2022b] provides an offline QVI (robust fitted Q-iteration) algorithm, which is specifically for TV uncertainty set. Since there are some differences between offline robust RL and online robust RL, we have not compared this work with our work. 

% We believe that leveraging the intrinsic low-rankness of MDPs and introducing some kind of representation, avoiding the $\mathcal{O}(SA)$ order in the final sample complexity can be regarded as a question:

% \textit{Whether our robust tabular MLMC algorithm can extend to the function approximation setting to reduce the complexity? } 


Here we notice there are two different works [Panaganti $\&$ Kalathil, 2022a] and [Panaganti $\&$ Kalathil, 2022b]. In this paper, we cite the work [Panaganti $\&$ Kalathil, 2022a] in tables, which is model-based for **tabular** robust MDPs. [Panaganti $\&$ Kalathil, 2022b] provides studies for robust RL with **linear function approximation**. Since we focus on the tabular setting in this paper, and therefore, we did not discuss [Panaganti $\&$ Kalathil, 2022b] in the table. 


%, , which is specifically for TV uncertainty set. Since there are some differences between offline robust RL and online robust RL, we have not compared this work with our work. 


We agree with the reviewer that additional techniques like function approximation or low-dimensional structure are required when tackling large-scale problems. In this paper, we start with the fundamental tabular setting, and aims to understand even in this fundamental setting, how to design the algorithm and further analyze their complexity. It is also of our future interest to extend our approach to large-scale problems using function approximation or low-dimensional structure.

% However, we believe our biased stochastic approximation methodology can be used when designing algorithms. Namely, when it is challenge to obtain an unbiased estimator, we can instead adopt a biased one and characterize the resulting algorithms. As long as the bias is controllable, we can similarly balance it and the estimation cost to design efficient algorithms. We leave this problem as a future research direction. 

%We believe that leveraging the intrinsic low-rankness of MDPs and introducing a representation to avoid the $\mathcal{O}(SA)$ order in final sample complexity poses an important question:

%\textit{Can our robust tabular MLMC algorithm be extended to the function approximation setting to reduce complexity?}

%This question is significant for robust RL problems. Undoubtedly, our threshold MLMC can be extended to a function approximation setting. However, a key distinction between model-free and model-based works, as shown in Tables 1-3, is that model-based works require $\mathcal{\epsilon^{-2}}$ order samples to estimate the model and obtain the dual value; in contrast, the estimated samples of threshold model-free works is $\mathcal{O}(N_{\max}) = \mathcal{O}(\log(\epsilon^{-1}))$ for estimating the dual value. When extending model-based methods to the function approximation setting, we observe that in [Panaganti \& Kalathil, 2022b], obtaining $g_k$ that minimizes eq. (8) incurs significant computational and memory costs, as $N = \mathcal{O}(\epsilon^{-2})$. However, extending our MLMC model-free method to the function approximation setting is expected to greatly reduce computational and memory costs.

%Integrating the model-free MLMC with function approximation presents a meaningful and valuable inquiry in the field of robust RL. However, such integration should be built upon the foundation laid by this paper, rather than being its central focus. We believe this area merits further investigation based on our current research, and we intend to concentrate on this question in future work.

% Integrating the model-free MLMC with function approximation poses a meaningful and valuable question in robust RL. However, these works need to be based on this paper and not the key of this paper. We believe this warrants further investigation based on our current research, and we plan to focus on this question in the future.

% It is a significant question for robust RL problems. It is no double that our threshold MLMC can extend to function approximation setting. However, the main difference between model-free and model-based works in Table 1-3 is that model-based works require $\mathcal{\epsilon^{-2}}$ order samples to estimate the model and get the dual value; the threshold model-free works require $\mathcal{O}(N_{\max})=\mathcal{O}(\log(\epsilon^{-1}))$ samples to estimate the dual value. When extending Model-based methods to the function approximation set, we notice that in [Panaganti $\&$ Kalathil, 2022b], to get the $g_k$ which minimizes the eq. (8), the computational and memory costs are expensive since $N=\mathcal{O}(\epsilon^{-2})$. However, when extending our MLMC model-free method to the function approximation set, it is foreseeable that the computational and memory costs will be saved greatly. 

% Combining the model-free MLMC with function approximation is a meaningful and valuable question in robust RL.  We believe these should be further work based on our current research and we will force on this question in the future. 


Panaganti, et al. (2022a). Sample complexity of robust reinforcement learning with a generative model. 

Panaganti, et al. (2022b). Robust reinforcement learning using offline data. 



\textbf{***Weaknesses 2.c: I appreciate the fact that Tables 1 through 3 honestly list the complexities of model-based algorithms, some of which turn out to be superior to the proposed model-free algorithm. That said, the paper largely avoids talking about the better complexities achieved by model-based algorithms.***} 

Firstly, we want to clarify that we provided a discussion and admit the better complexity of model-based approach in the end of Section 4. And it is common that model-based approaches are more sample-efficient than most vanilla model-free ones, and only a few model-free approaches with variance reduction technique achieve the same complexity as model-based ones.

However, the major benefits that model-free approaches offered is in terms of memory/space efficiency.  It hence makes the model-free approaches more appealing for scenarios where computational resources are limited or when working with large-scale problems.Without any need to store the whole transition model, the memory space for model-free approaches are reduced by $\mathcal{O}(S^2A)$.

On the other hand, with variance reduction methods, our T-MLMC is expected to achieve the same sample complexity as model-based method. We leave it as a future interest. 

%However, this is not the center of our work. We believe applying the variance reduction method to our T-MLMC is meaningful and the future work based on our work. 
% We firstly provide the sample complexity analysis of model-free robust Q-learning in TV and $\chi^2$ uncertaity
% We agree that model-based methods have better sample complexity than our model-based methods. In general, model-free methods are computational and memory economic.


\zou{i do not agree that model based methods usually achieve better complexity}
% We acknowledge that model-based methods typically achieve better sample complexities compared to our model-free methods.
% Generally, model-free methods offer advantages in terms of computational and memory efficiency. These efficiencies make the model-free setting more appealing for scenarios where computational resources are limited or when working with large-scale problems.




To reviewer XEXK:
\textbf{***Weaknesses: Results improve marginally comparing with existing methods of Wang et al., 2023 ab in terms of improved constant dependences.***}

We first highlight that we additionally provide results for two more uncertainty sets: total variation and Chi-square, compared to [Wang et al., 2023 ab]. The analyses of which are highly different from the one for KL divergence in [Wang et al., 2023 ab].

Compared with previous works on KL uncertainty set, besides the sample complexity, our major improvement is that we get rid of the restrictive assumption they made, which is $$\frac{1}{2}p_\wedge \geq 1- e^{-\sigma},$$ where the $p_\wedge$ is the  minimum positive entry of the nominal transition kernel.
We can easily construct an uncertainty set such that this assumption does not hold. And therefore, this condition restricts the applicability of their approaches, and in this paper our results hold without the need of such assumption.



%From the fact that $1-e^{-\sigma}\geq \frac{\sigma}{2}$ when $1-e^{-\sigma}\leq \frac{1}{2}$, we can get that  $\frac{1}{2}p_\wedge \geq 1- e^{-\sigma}\geq \frac{\sigma}{2}$. Therefore, the assumption requires that the uncertainty set radius $\sigma$ satisfies that $ \sigma\leq p_\wedge$. Therefore, the uncertainty level will be constrained by $p_\wedge$. In general, the $p_\wedge$ is hard to access and small. 

%However, our T-MLMC algorithm is not based on this assumption. 

\textbf{***Questions 1: As noted in Sec. 3, the T-MLMC estimator also requires infinitely many samples (as $N_{\max}\to \infty$) to reduce biases. What is the intuition of its improvement over the original MLMC estimator (which requires infinite expected amount of samples as well)?***}

In our T-MLMC design, we do not require $N_{\max}\to \infty$. Instead, we set it to be a fixed number of order $N_{\max}=\mathcal{O}(\log1/\epsilon)$, where $\epsilon$ is the desired accuracy. With our design, the sample required at each step is at most $\mathcal{O}(N_{\max} \cdot 2^{\frac{N_{\max}}{2}})$ and is hence finite. 

On the other hand, the expected number of sample required for vanilla MLMC in [Liu, et. al., 2022] is $\mathbb{E}_{N\sim \textbf{Geom}(0.5)}[N*2^N]=\infty$. 

As an intuition of such an improvement, the reduction in sample complexity comes with the price of bias. Namely when using the threshold, the resulting estimator is no longer unbiased. But we can show that even with the bias, our algorithm still obtains the optimal policy, with a finite sample complexity.



%The T-MLMC estimator can unbiasedly estimate the dual function value when $N_{\max} \to \infty$. However, within the context of our work, the number of estimated samples is $\mathcal{O}(N_{\max})$. When $N_{\max} \to \infty$ (which corresponds to the scenario with the original MLMC estimator), the estimator is unbiased, but requires an infinite number of samples.

%Generally, the sample complexity analysis demands an estimator with a bounded bias rather than a strictly unbiased estimator. Therefore, we explore the trade-off between estimator bias and the number of samples, given the threshold $N_{\max}$. As demonstrated in Theorem 4.1, with a given $N_{\max}$ (where the number of samples per iteration is $\mathcal{O}(N_{\max})$), the estimator bias is $\mathcal{O}(N_{\max} \cdot 2^{\frac{N_{\max}}{2}})$, and the variance is bounded by $\mathcal{O}(N_{\max})$.



% The T-MLMC estimator could unbiased estimate the dual function value when $N_{\max}\to \infty$. However, under our work setting, the estimated samples are $\mathcal{O}(N_{\max})$.  When $N_{\max}\to \infty$ (which is the original MLMC estimator),  the estimator is unbiased but the samples are infinite. 

% In general, the sample complexity analysis only requires an estimator with bounded bias instead of a strictly unbiased estimator. Therefore, we analyze the trade-off between the estimator bias and samples given the threshold $N_{\max}$. Shown as Theorem 4.1, given $N_{\max}$ (where the samples each iteration is $\mathcal{O}(N_{\max})$), the estimator bias reach $\mathcal{O}(N_{\max}2^{\frac{N_{\max}}{2}})$ and variance is bounded by $\mathcal{O}(N_{\max})$. 


\textbf{***Questions 2: Below Eq. (12), why is the calculation separated according to even and odd indices***}

There is no requirement for the indices of the sampled data to be of any specific type. As long as the dataset is partitioned to two sub-collections with the same sizes. We adopt the even-odd separation as in the standard MLMC approach, but it can be directly generalized to other partitions.  


% Thanks for your comments. Here we provide the motivation of MLMC algorithm. 
% To estimate a function $f(p)=f(\hat{p}_\infty)$, we have
% $f(\hat p_\infty)= f(\hat p_{2^0})-f(\hat p_{2^0})+f(\hat p_{2^1})-f(\hat p_{2^1})+...=f(\hat p_{2^0})+\sum_{N=0}^\infty f(\hat p_{2^N})- f(\hat p_{2^{N-1}})=f(\hat p_{2^0})+\sum_{N=0}^\infty P_N \frac{f(\hat p_{2^{N+1}})- f(\hat p_{2^{N}})}{P_N}= f(\hat p_{2^0})+\mathbb{E}_{N\sim P_N}[\frac{f(\hat p_{2^{N+1}})- f(\hat p_{2^N})}{P_N}] $. Then, we set $N_1\sim \text{GEO}(g)$, i.e. $\mathbb P(N)=P_N$.  
% The "correction term" $\delta^{r, \rho(\sigma)}_{s,a,N_1}=\frac{f(\hat p_{2^{N_1+1}})- f(\hat p_{2^{N_1}})}{P_{N_1}} $ is the unbiased estimator of $\mathbb{E}_{N\sim P_N}[\frac{f(\hat p_{2^{N+1}})- f(\hat p_{2^N})}{P_N}]$, where the term $f(\hat p_{2^{N_1}})$ is estimated by $ \frac{f(\hat p^O_{2^{N_1}})+f(\hat p^E_{2^{N_1}})}{2}$.

% There is not any requirement of the indices of sampled data. Any other indices which can divide the data into two groups  can be accepted.
To reviewer j69A:

\textbf{***Weaknesses 1: explain the key differences and challenges in the analysis***}
Our algorithm enjoys a distinct threshold design, however, the motivations of algorithm designs are different. In [Wang et al, 2023b], they need restrictive assumptions to ensure their MLMC estimator is **unbiased** and **requires finite samples**; In our algorithm, we introduce a threshold to remove the assumption, at the price of a **biased** estimator. This introduces a great challenge in the analysis, since the accumulative error from the bias needs to be analyzed. Our technical contribution lies in the analysis of our biased stochastic approximation framework. 

We also want to highlight that we further developed studies for two more uncertainty sets that are not previously studied. It is worth noting that such an extension is not possible using previous approaches. The method in [Wang et al., 2023ab] relies on the smoothness of the dual-form of the DRO problem with KL-divergence uncertainty set. Therefore, an extra assumption is made in  [Wang et al., 2023ab] to ensure its smoothness. However, the dual functions for TV and $\chi^2$ uncertainty set are not smooth, implying no direct extension from KL-case. Whereas, our analysis does not rely on such a smoothness property, and can be adapted under all uncertainty sets. 
 

\textbf{***Weaknesses 2: provide empirical study applying our T-MLMC algorithm  in real-life DR-RL problems***}
Our experiments are provided in the appendix due to space limitation. 

We further provide an experiment on a real-life problem: recycling robot problem (Example 3.3 [1]. A mobile robot running on a rechargeable battery aims to collect empty soda cans. It has 2 battery levels: low and high. The robot can either 1) search for empty cans; 2) remain stationary and wait for someone to bring it a can; 3) go back to its home base to recharge. Under low (high) battery level, the robot finds an empty can with probabilities $\alpha$ ($\beta$), and remains at the same battery level.  If the robot goes out to  search but finds nothing, it will run out of its battery and can only be carried back by human.  We introduce model uncertainty to the probabilities $\alpha,\beta$ of finding an empty can if the robot chooses the action `search'. We implement our algorithm under this problem. 

[1]Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018.



%We set $\delta=0.4$ and implement our algorithms and vanilla Q-learning under the nominal environment ($\alpha=\beta=0.5$) with stepsize $0.01$. To show the difference among the policies that the algorithms learned, we plot the difference of $Q$ values at low battery level in \Cref{Fig.robot1}. In the low battery level, the robust algorithms find conservative policies which choose to wait instead of search, whereas the vanilla Q-learning finds a policy that chooses to search. To test the robustness of the obtained policies, we evaluate the average-reward of the learned policies in perturbed environments. Specifically, let $x$ denote the amplitude of the perturbation. Then, we estimate the worst performance of the two policies over the testing uncertainty set $(0.5-x,0.5+x)$, and plot them in \Cref{Fig.robot2}. It can be seen that when the perturbation is small, the true worst-case kernels (w.r.t. $\delta$ during training) are far from the testing environment, and hence the vanilla Q-learning has a higher reward; however, as the perturbation level becomes larger, the testing environment gets closer to the worst-case kernels, and then our robust algorithms perform better. It can be seen that the performance of Q-learning decreases rapidly while our robust algorithm is stable and outperforms the non-robust Q-learning. This implies that our algorithm is robust to the model uncertainty. 



\zou{provide the new simulation results}
%Thanks for your comments. We will update the simulation result as soon as possible. 

\textbf{***Questions: Whether [Wang et al., 2023b] can be extended to the other two uncertainty sets easily?  If so, is it possible to have lower sample complexities in those settings? If not, can the authors explain the challenge and why the proposed algorithm can work in all settings?***}

The analysis method in [Wang et al., 2023b] relies on the smoothness of the dual-form of the DRO problem with KL-divergence defined uncertainty. 
Therefore, an extra assumption is made in  [Wang et al., 2023b] to ensure its smoothness (eq. (10)). However, the dual functions for TV and $\chi^2$ uncertainty sets are not smooth, implying that direct extension of  [Wang et al, 2023b] is not feasible.  
Whereas, our analysis does not rely on such a smoothness assumption, and can be applied under all uncertainty sets. 


The lower complexity in [Wang et al., 2023b] is due to the variance reduction technique. It is expected that such technique can also be applied in our algorithm design, but it requires more detailed analysis and is left for future work. 

%applies the variance reduction method to decrease the sample complexity of MLMC methods. This variance reduction method can be extended to our T-MLMC algorithm. However, this variance reduction is the future work based on this paper instead of the key point. 


To reviewer 6UQA:

\textbf{***Contribution***}
The design of the threshold in MLMC is new and novel, as it fundamentally changes the original motivation behind the MLMC. Specifically, the MLMC is designed to construct an **unbiased** estimator, with the price of requiring infinite number of samples or restrictive assumptions. To bypass these issues, we design this threshold in MLMC that enables us to control the number of samples, resulting a biased estimator. We further develop a novel analysis of biased stochastic approximation framework, showing that although it is biased, our algorithm still converges to the optimal policy with a reduced sample complexity. This motivation of design is hence different from MLMC, and enables us to obtain stronger results. 

We also want to highlight that we further developed studies for two more uncertainty sets that are not previously studied. It is worth noting that such an extension is not possible using previous approaches. The method in [Wang et al., 2023b] relies on the smoothness of the dual-form of the DRO problem with KL-divergence uncertainty set, which does not hold for TV and CS uncertainty sets.  
Whereas, our analysis does not rely on it and can be adapted under all uncertainty sets. 

%We first highlight that we additionally provide results for two more uncertainty sets: total variation and Chi-square, compared to [Wang et al., 2023 ab]. 

%And when compared with previous works on KL uncertainty set, besides the sample complexity, the major improvement is that we get rid of the restrictive assumption they made. It is assumed that
%$$\frac{1}{2}p_\wedge \geq 1- e^{-\sigma},$$ where the $p_\wedge$ is the  minimum positive entry of the nominal transition kernel. This condition restricts the capability of their approaches, whereas ours does not require it. 

%Therefore, we provide a new approach to analyze the sample complexity, which is different from the work in [Wang et al., 2023 ab]. 

\textbf{***Comments 1: results in [Shi et at., 2023] depend on the uncertainty level***}

Thanks for your comments. Since we mainly want to highlight the dependence on other parameters, e.g., $(1-\gamma, p_\wedge, |\mathcal{S}| \text{ and } |\mathcal{A}|) $, we omit the uncertainty level in the table. We add them in the revised version of the paper.

%The result for TV uncertainty set in [Shi et at., 2023] should be $\frac{SA}{(1-\gamma)^3 \epsilon^2}$ when $\sigma \leq 1-\gamma$ and $\frac{SA}{(1-\gamma)^2 \sigma \epsilon^2}$ when $1\geq\sigma \geq 1-\gamma$. For $\chi^2$ uncertainty set, the result is  $\frac{SA(1+\sigma)}{(1-\gamma)^4 \epsilon^2}$. We will update the Table 1-3 in revised version including the dependence of uncertainty level $\sigma$. 

\textbf{***Comments 2: In the contribution part, the claim that your result on KL “significantly enhancing previous findings for the KL divergence model” is not clear until the end of section 4***}
We will add a discussion of the results in the contribution part in the revision. Specifically, the improvements of our results are two-folds: we get rid of any additional assumptions; And we have a better complexity compared to the vanilla algorithm in [Wang et al., 2023b].

The result in [Wang et al., 2023b] relies on the variance reduction technique, which is a standard technique that can reduce the variance and thus the sample complexity. We also expect that our complexity can be further reduced if variance reduction is used. We will leave it as a future interest. 

\textbf{***Comments 3: include some function approximation results in DR-RL***}

Thanks for your comments. Since we mainly focus on the tabular setting we omit the large body of function approximation researches. We will include a part of  discussion on these works. [Tamar et al., 2014] firstly extend the robust MDP to linear function approximation and provide asymptotic convergence. Subsequently, numerous studies have explored robust MDPs with linear function approximation, including works by[Badrinath et al., 2021],[Ma et al., 2022],[Blanchet et al., 2023], [Zhou et al., 2023], [Panaganti, et. al. 2022] and [Liu, \& Xu, 2024]. These studies have proposed related algorithms and theoretical results. 

\zou{add the discussion of function approximation results here.}

\textbf{***Comments 4: the results in Theorem 4.1 omit the range $r_{\max}$ and $\frac{1}{1-\gamma}$***}

Thanks for your comments. The explicate dependence on $r_{\max}$ and $\frac{1}{1-\gamma}$ can be found in eq. (44),(55) for TV, (68),(79) for $\chi^2$); eq. (98),(112) for KL. We will revise the paper accordingly.

%, we provide the detailed expression of the estimator bias and variance bound. %However, Theorem 4.1 can be applied to TV, $\chi^2$ and KL as well, which has a different order of these terms, e.g. $\sigma$ and $p_\wedge$. Thus, we only provide the depends of $N_{\max}$ and omit the other terms. 

\textbf{***Comments 5: The $P_{N_1}$ is not defined.***}

Thanks for your comments. $P_{N_1}$ denotes the probability of $N=N_1$ where $N$ is the random level number generated from the geometry distribution. 

\textbf{***Comments 6: How to estimate the empirical reward distribution $\hat \mu$?For example, using some non-parametric estimators?***}
 % In our work, we assume that the reward distribution is discrete. Therefore, we estimate the robust worst-case reward value with the same method with robust worst-case value function.  \

 In our work, we assume that the reward distribution is discrete. Therefore, we estimate $\hat{\mu}$ using the empirical distribution, i.e., $\hat{\mu}=\frac{\sum_i \textbf{1}_{R=r_i}}{N}$. In practice, when the reward is continuous, we can use k nearest neighbor or kernel density estimation to estimate the reward distribution.


\textbf{***Comments 7:Typo: the sample complexity in Theorem 4.2 should be "greater than or equal to".***}

Thanks for your comments. We will correct this typo in revised version. 

\textbf{***Comments 8:The argument is not accurate for  [Li et al., 2020, 2021]:***}{\color{red}whether reach $(1-\gamma)^{-3}$: need check}

 We will modify our discussion accordingly. Our results match the state-of-the-art in model-based approaches for $\chi^2$ and KL uncertainty sets in terms of $1-\gamma$, but presents a gap between the model-based results under TV case. Comparing our analysis with [Shi et al., 2022], we prove that applying a tighter variance analysis improves our TV results to $\mathcal{O}((1-\gamma)^{-4})$. We will update the results in revised version.  Additionally, applying variance reduction to our T-MLMC algorithm could achieve the same complexity as [Shi et al., 2022], which we aim to explore in the future.

% Thanks for the comment and we apologize for the incorrect description. We will modify our discussion accordingly. Namely, our results match the state-of-the-art in model-based approaches for $\chi^2$ and KL uncertainty sets in terms of $1-\gamma$, and presents a gap between the model-based results under TV case. It is of our future interest to investigate whether robust RL should enjoy a similar complexity with non-robust RL and whether our results can be further improved. 

% Thanks for the comment and we apologize for the incorrect description. We will modify our discussion accordingly. Namely, our results match the state-of-the-art in model-based approaches for $\chi^2$ and KL uncertainty sets in terms of $1-\gamma$, and presents a gap between the model-based results under TV case. Compared the analysis in our work with the analysis in [Shi et al, 2022], we notice that our reault for TV could reach $\mathcal{O}((1-\gamma)^{-4})$ when applying a tighter analysis of estimator variance. We will update the results in revised version.  Moreover, when applying the variance reduction method to our T-MLMC algorithm,  it is further expected to reach the same order of complexity as in [Shi et al., 2022]. We will update the paper and add those results in the revision.


%We would like to describe that  $(1-\gamma)^{-5}$ order is tight (without variance reduction) since the best results for model-based robust Q-learning specifically for $\chi^2$ and KL are $\mathcal{O}((1-\gamma)^{-4})$ shown in Table 2,3. For model-free vanilla Q-learning, [Li et al., 2021] reaches $\mathcal{O}((1-\gamma)^{-4})$ sample order without variance reduction and reach $\mathcal{O}((1-\gamma)^{-3})$ order with variance reduction. 


\textbf{***Comments 9: Provide the literature for the argument “Compared to the model-based methods, our complexity presents an additional $\mathcal{O}((1-\gamma)^{-1})$- order dependence, which is common in model-free algorithms.” ***}
We apologize for the misleading statement here. We will correct it in the revision.  

Take the standard non-robust RL problem as an example. The lower bound/optimal complexity for model-based RL is $\mathcal{O}((1-\gamma)^{-3})$ [1], whereas the model-free Q-learning presents  $\mathcal{O}((1-\gamma)^{-4})$ [Li et al., 2021] (without variance reduction technique), and variance reduced Q-learning achieves the minimax lower bound. From this aspect, the vanilla model-free algorithm has an additional dependence on $1-\gamma$, which is expected in the robust setting. 

[1] Li, et al. Breaking the sample size barrier in model-based reinforcement learning with a generative model. (2020)

% [2] Li, Gen, et al. "Is Q-learning minimax optimal? a tight sample complexity analysis." Operations Research 72.1 (2024): 222-236.

% \zou{add the paper of variance reduced Q is minimax optimal}
%For vanilla $Q$-learning,  [Li et al., 2021] reaches $\mathcal{O}((1-\gamma)^{-4})$ sample order without variance reduction and reach $\mathcal{O}((1-\gamma)^{-3})$ order with variance reduction. 

\textbf{***Comments 10: Why “This assumption significantly limits the applicability of their results”?***}
In [Wang et. al., 2023ab],  the assumption is required that
$\frac{1}{2}p_\wedge \geq 1- e^{-\sigma}$. It implies that $\frac{1}{2}p_\wedge \geq 1- e^{-\sigma}\geq \frac{\sigma}{2}$, the radius of uncertainty set has to be very small if $p_\wedge$ is small. 

Moreover, since there is no information about $p_\wedge$ in practice, it can be challenging to design an uncertainty set satisfying the assumption. 




%We can easily come up with an uncertainty set such that this assumption is violated.  For example, let $p_\wedge=0.05$ and $\sigma=0.1$, then this assumption is not satisfied.





% [2] Panaganti, Kishan, et al. "Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage." arXiv preprint arXiv:2310.18434 (2023). 

% This condition restricts the capability of their approaches, whereas ours does not require it. 
**Contribution**

The design of the threshold in MLMC is new and novel, as it fundamentally changes the original motivation behind the MLMC. Specifically, the MLMC is designed to construct an **unbiased** estimator, with the price of requiring infinite number of samples or restrictive assumptions. To bypass these issues, we design this threshold in MLMC that enables us to control the number of samples, resulting a biased estimator. We further develop a novel analysis of biased stochastic approximation framework, showing that although it is biased, our algorithm still converges to the optimal policy with a reduced sample complexity. This motivation of design is hence different from MLMC, and enables us to obtain stronger results. 

We also want to highlight that we further developed studies for two more uncertainty sets that are not previously studied. It is worth noting that such an extension is not possible using previous approaches. The method in [Wang et al., 2023b] relies on the smoothness of the dual-form of the DRO problem with KL-divergence uncertainty set, which does not hold for TV and CS uncertainty sets. Whereas, our analysis does not rely on it and can be adapted under all uncertainty sets.

**Comments 1:** Since we mainly want to highlight the dependence on other parameters, e.g., $(1-\gamma, p_\wedge, S , A) $. we omit the uncertainty level in the table. We add them in the revised version.

**Comments 2:** Our comparison is mainly with vanilla approaches. The improvements of our results are two-folds: we get rid of any additional assumptions; And we have a better complexity compared to the vanilla algorithm. Their better result relies on the variance reduction technique, which is a standard technique to reduce sample complexity and can be viewed as a modification of vanilla algorithms, hence we do not focus on it.   We will make it more clear in paper. 

**Comments 3:**  Since we mainly focus on the tabular setting we omit the discussion. We will include a part of  discussion on these works. [Tamar et al., 2014] firstly extend the robust MDP to linear function approximation and provide asymptotic convergence. Subsequently, numerous studies have explored robust MDPs with linear function approximation, including works by [Panaganti et al, 2022],  [Badrinath et al., 2021], [Ma et al., 2022], [Blanchet et al., 2023], [Zhou et al., 2023],and [Liu, \& Xu, 2024]. 

**Comments 4:** The explicate dependence on $r_{\max}$ and $\frac{1}{1-\gamma}$ can be found in eq. (44),(55) for TV, (68),(79) for $\chi^2$); eq. (98),(112) for KL. We will revise the paper accordingly.

**Comments 5:** $P_{N_1}$ denotes the probability of $N=N_1$ where $N$ is the random level number generated from the geometry distribution. 

**Comments 6:** In our work, we assume the reward distribution is discrete. Therefore, we estimate $\hat{\mu}$ using the empirical distribution, i.e., $\hat{\mu}=\frac{\sum_i \textbf{1}_{R=r_i}}{N}$. In practice, when the reward is continuous, we can use k nearest neighbor or kernel density estimation to estimate the reward distribution.

**Comments 7:** We will correct this typo in revised version. 

**Comments 8:** We will correct our discussion accordingly. Our results match the state-of-the-art in model-based approaches for $\chi^2$ and KL uncertainty sets in terms of $1-\gamma$, but presents a gap between the model-based results under TV case. Based [Shi et al., 2022], we expect to apply a tighter variance analysis and improve our TV results to $\mathcal{O}((1-\gamma)^{-4})$. We will update the results.  Additionally, applying variance reduction to our T-MLMC algorithm could achieve the same complexity as [Shi et al., 2022], which we aim to explore in the future.

**Comments 9:** We apologize for the misleading statement here.  The lower bound/optimal complexity for model-based RL is $\mathcal{O}((1-\gamma)^{-3})$ [1], whereas the model-free Q-learning presents  $\mathcal{O}((1-\gamma)^{-4})$ [Li et al., 2021] (without variance reduction technique), and variance reduced Q-learning achieves the minimax lower bound. From this aspect, the vanilla model-free algorithm has an additional dependence on $1-\gamma$, which is expected in the robust setting. 

[1] Li et al. Breaking the sample size barrier in model-based reinforcement learning with a generative model.

**Comments 10:**In [Wang et. al., 2023ab],  the assumption is required that
$\frac{1}{2}p_\wedge \geq 1- e^{-\sigma}$. It implies that $\frac{1}{2}p_\wedge \geq 1- e^{-\sigma}\geq \frac{\sigma}{2}$, the radius of uncertainty set has to be very small if $p_\wedge$ is small. 

Moreover, since there is no information about $p_\wedge$ in practice, it can be challenging to design an uncertainty set satisfying the assumption. 

To reviewer dDey:

\textbf{***Weaknesses 1: Lack of baseline comparison in experiments.***}
Thanks for your comments. We have the optimal robust value functions as the baseline, but we will include comparison with other robust RL algorithms.

%We will add the existing two  update the baseline result as soon as possible. 

\textbf{***Weaknesses 2: Performance of T-MLMC algorithm in real-world scale problems.***} 
Since we mainly consider tabular setting so we only adopt simple experiments. Extending our T-MLMC to large scale problems, although doable, can results in large computational cost. We will add experiments on large-scale problems and explore potential theoretical extensions.
\zou{add experiments}

\textbf{***Comments 1: more extensive comparison with baseline models, especially in demonstrating how the proposed methods converge on the same problems, to highlight its advantages more clearly ***}

Thanks for your comments. We will update the baseline result as soon as possible. 


\textbf{***Comments 2: provide specific examples or case studies where your proposed algorithm has been or could be successfully applied to solve real-world-level tasks? How does the algorithm address the challenges specific to these applications?***}

Thanks for your comments. We currently apply our T-MLMC algorithm to design the policy of recycling robot problem (Example 3.3 [1]). A mobile robot running on a rechargeable battery aims to collect empty soda cans. It has 2 battery levels: low and high. The robot can either 1) search for empty cans; 2) remain stationary and wait for someone to bring it a can; 3) go back to its home base to recharge. Under low (high) battery level, the robot finds an empty can with probabilities $\alpha$ ($\beta$), and remains at the same battery level.  If the robot goes out to  search but finds nothing, it will run out of its battery and can only be carried back by human.  We introduce model uncertainty to the probabilities $\alpha,\beta$ of finding an empty can if the robot chooses the action `search'. We implement our algorithm under this problem. 

The challenge here is that the collected samples can be limited, and it may result in inaccurate estimation and sub-optimal performance. Our T-MLMC algorithm ensures that with limited sample, we will obtain the optimal policy under the problem. Compared to previous methods with strong assumption or infinite sample numbers, our algorithm can be implemented in practice and has better capability. 

Besides, we also plan to extend our T-MLMC algorithm to solve the stability problems in wireless connections.

We will update the results when finished.

[1]Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018.

\zou{we need to provide simulation results now! during this rebuttal window!!!!}

\textbf{***Comments 3: discuss further how this bias impacts the learned policy's overall performance and reliability and compare it to the bias present in other methods ***}

The bias in the operator estimation can be made small by choosing $N_{\max}=\log1/\epsilon$, and its effect on the performance of the learned policy can be bounded. Namely, our algorithm is shown to converge to the **optimal** robust policy, and hence has the best performance. Moreover, the sample complexity/convergence rate of our algorithm is better, meaning that with the same amount of data, our algorithm can obtain a better policy. 

The trade-off is between the bias and sample complexity/assumptions. Namely, to ensure the estimator is unbiased, either infinite samples [Liu et al., 2022] or restrictive assumptions[Wang et al., 2023 ab] are required; Yet if we admit bias, we no longer need any assumptions or infinite data. However, both approaches will have similar performance, as they both converge to the optimal policy. 


%Introducing the threshold also allows us to get rid of restrictive assumption in  [Wang et. al., 2023 ab], introduced to ensure unbiased MLMC
%However, this fact does not mean that the convergence of $Q$-table in T-MLMC is unstable. The $Q$ table will converge to a surrogate sub-optimal robust policy which is in $\epsilon_{\text{bias}}$ neighborhood by the optimal robust policy, when the $Q$ table is updated according to the biased T-MLMC estimator. 

%The model-free MLMC algorithms in [Wang et. al., 2023 ab] will not lead to the bias. However, our T-MLMC algorithm will also converge to the $\epsilon$-accurate robust optimal policy since the bias $\epsilon_{\text{bias}}$ can be bounded when applying suitable/large enough threshold, $N_{\max}$. Therefore, in theory, [Wang et. al., 2023 ab] and our work will have the same performance. 

%Moreover, there is an extra assumption in [Wang et. al., 2023 ab] for the convergence guarantee, which limits the uncertainty level by the minimum non-zeros support $p_\wedge$. Besides, the work in [Wang et. al., 2023 ab] is specifically for KL uncertainty set, which limits the reliability.

%The model-based algorithms shown in Table 1-3 will lead to the bias as well. Therefore, in theory, these works and our work will have the same performance. The model-based algorithms will reach a better sample complexity. However, the computational and memory costs are expensive for model-based algorithms since $\mathcal{O}(\epsilon^{-2]})$ samples are required per step.


\end{document}