\section{Introduction}\label{sec:introduction}

Reinforcement learning (RL) recently arises as a compelling paradigm for modeling machine learning applications with sequential decision making. In such a problem, an online learner interacts with the environment sequentially over Markov decision processes (MDPs), and aims to find a desirable policy for achieving an accumulated loss (or reward). Various algorithms have been developed for RL problems and have been shown theoretically to achieve polynomial sample efficiency  in~\cite{zimin2013online,azar2017minimax,jin2018q,agarwal2019reinforcement,bai2019provably,jin2020learning,jin2020provably,cai2020provably,gao2021provably,lykouris2021corruption,qiao2022sample}, etc.

In addition to the metric of losses, {\bf switching costs}, which capture the costs for changing policies during the execution of RL algorithms, are also attracting increasing attention. This is motivated by many practical scenarios where the online learners cannot change their policies for free. For example, in recommendation systems, each change of the recommendation involves the processing of a huge amount of data and additional computational costs~\cite{theocharous2015personalized}. Similarly, in healthcare, each change of the medical treatment requires substantial human efforts and time-consuming tests and trials~\cite{yu2021reinforcement}. Such switching costs are also required to be considered in many other areas, e.g., robotics applications~\cite{kober2013reinforcement}, education software~\cite{bennane2013adaptive}, computer networking~\cite{xu2018experience}, and database optimization~\cite{krishnan2018learning}. 

Switching costs have been studied in various problems
(please see Sec.~\ref{sec:introrelatedwork} for some examples). Among these studies, a relevant line of research is along bandit learning~\cite{geulen2010regret,dekel2014bandits,arora2019bandits,shi2022power}. More recently, switching costs have received considerable attention in more general RL settings~\cite{bai2019provably,gao2021provably,wang2021provably,qiao2022sample}. However, these studies have mainly focused on \emph{static} RL, where the loss distribution is assumed to be fixed during the learning process. Thus, practical scenarios where the loss distribution could be non-stationary or even adversarial are not characterized or considered. 

While \textbf{adversarial RL} better models the non-stationary or adversarial changes of the loss distribution, to the best of our knowledge, an open problem remains: \emph{how to develop a provably efficient algorithm for adversarial RL with switching costs?} Intuitively, in adversarial RL, since much more often policy switches would be needed to adapt to the time-varying environment, it would be much more difficult to achieve a low regret (including both the standard loss regret and the switching costs, please see (\ref{eq:defineregret})). Indeed, without a special design to reduce switching, existing algorithms for adversarial RL with $T$ episodes, such as those in~\cite{zimin2013online,jin2020learning,lee2020bias} and~\cite{lykouris2021corruption}, could yield poor performance of linear-to-$T$ number of policy switches. \emph{Thus, the goal of this paper is to make the first effort along this open direction.}

Our first aim is to develop provably efficient algorithms that enjoy low regrets in adversarial RL with switching costs. This requires a careful reduction of switching under non-stationary or adversarial loss distributions. It turns out that previous approaches to reduce switching in \emph{static} RL (e.g., those in~\cite{bai2019provably} and ~\cite{qiao2022sample}) are not applicable here. Specifically, the high-level idea in static RL is to switch faster at the beginning, while switch slower and slower for later episodes. Such a method performs well in static RL, mainly because after learning enough information about losses at the beginning (by switching faster), the learner can estimate the assumed \emph{fixed} loss-distribution accurately enough with high probability in later episodes. Thus, even though the learner switches slower and slower, a low regret is still achievable with high probability. In contrast, when the loss distribution could change arbitrarily, this method does not work. This is mainly because what the learner learned in the past may not be that useful for the future. For example, when the loss distribution is adversarial, a state-action pair with small losses in the past may incur large losses in the future. Thus, new ideas are required for addressing switching costs in adversarial RL.

Our second aim is to understand fundamentally whether the new challenge of switching costs in adversarial RL significantly increases the regret. This requires a converse result, i.e., a lower bound on the regret, that holds for any RL algorithm. Further, we aim to understand fundamentally whether the adversarial nature of RL indeed requires much more policy switches to achieve a low loss regret.








\subsection{Our Contributions}\label{subsec:introcontribution}

In this paper, we achieve the aforementioned goals and make the following three main contributions. (We use $\tilde{\Omega}$, $\tilde{\Theta}$ and $\tilde{O}$ to hide constants and logarithmic terms.)

\textbf{First}, we provide a lower bound (in Theorem~\ref{theorem:lowerbound}) that shows that, for adversarial RL with switching costs, the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs is no longer achievable. Further, we characterize precisely the new trade-off (in Theorem~\ref{theorem:lossswitchingtradeoff}) between the standard loss regret and the switching costs due to the adversarial nature of RL.

\textbf{Second}, we develop the first-known near-optimal algorithms for adversarial RL with switching costs. As we discussed above, the idea for reducing switching in static RL does not work well here. To handle the losses that can change arbitrarily, our design is inspired by the approach in~\cite{shi2022power} for bandit learning, but with two novel ideas. (a) We delay each switch by a fixed (but {\em tunable}) number of episodes, which ensures that switch occurs only every $\tilde{O}(T^{1/3})$ episodes. (b) The idea in (a) results in consistently long intervals of not switching. Since the bias in estimating losses from such a long interval tends to increase the regret, it is important to construct an unbiased estimate of losses for each interval. To achieve this, the idea in bandit learning is to consider all time-slots in each interval as one time-slot, which necessarily requires a \emph{single} chosen action in each interval. Such an approach is not applicable to our more general MDP setting, since there is no guarantee to visit a \emph{single} state-action pair due to state transitions. To resolve this issue, our novel idea is to decompose each interval, and then combine the losses of each state-action pair only from the episodes in which such a state-action pair is visited. Interestingly, although this combination is random and the loss is adversarial, the expectation of the estimated losses is (almost) unbiased.

\textbf{Third}, we establish the regret bounds for our new algorithms. For the case with a \textit{known} transition function, we show that our algorithm achieves an $\tilde{O}( (HSA)^{1/3} T^{2/3} )$ regret, which matches our lower bound. For the case with an \textit{unknown} transition function, we show that, with probability $1-\delta$, our algorithm achieves an $\tilde{O}\left( H^{2/3} (SA)^{1/3} T^{2/3} ( \ln\frac{TSA}{\delta})^{1/2} \right)$ regret, which matches our lower bound on the dependency of $T$, $S$ and $A$, except with a small factor of $\tilde{O}(H^{1/3})$. Therefore, the regrets of our new algorithms are near-optimal. Moreover, because of our novel ideas for estimating losses and delaying switching discussed above in a state-transition case, our proofs for the regrets involve several new analytical ideas. For example, in Lemma~\ref{lemma:knownexpectedlossunbias} and Lemma~\ref{lemma:unknownexpectedlossunbias}, we show that our new way of estimating losses is (almost) unbiased so that its effect on the regret is controllable. Moreover, to capture the effect of the delayed switching, our new analytical idea is to first bound the regret across intervals between adjacent switching events, and then relate the regret inside episodes of each interval to this bound (please see \textit{Step-2} of the proofs in Appendix~\ref{appendix:proofofthmknowntranprobregret} and Appendix~\ref{appendix:proofofthmunknowntranprobregret}).










\section{Related Work}\label{sec:introrelatedwork}

\textbf{Switching costs:} 

Switching costs have already received considerable attention in various online problems. For example, online convex optimization with switching costs has been studied in~\cite{lin2012online,bansal20152,chen2016using,goel2019beyond,shi2021competitive,shi2021combining}, etc. Convex body chasing with switching costs has been studied in~\cite{friedman1993convex,sellke2020chasing,bubeck2021online}, etc. Switching costs have also been studied in metrical task systems~\cite{borodin2005online}, online set covering~\cite{buchbinder2014competitive}, $k$-server problem~\cite{lin2020online}, online control~\cite{goel2019online,li2020online,lin2021perturbation}, etc. Moreover, switching costs have been studied in adversarial bandit learning, e.g., in~\cite{geulen2010regret,dekel2014bandits,arora2019bandits,shi2022power}. Our work in this paper can be viewed as a non-trivial generalization of these studies on bandit learning to adversarial MDP, where state transitions and multiple layers in each episode require new developments in both the algorithm design and regret analysis. 

\textbf{Static MDP:} There have been recent studies on static RL with switching costs. Specifically, for tabular MDP, \cite{bai2019provably} and~\cite{zhang2020almost} proposed RL algorithms that attain an $\tilde{O}\left( \sqrt{H^{\alpha}SAT} \cdot\ln\frac{TSA}{\delta} \right)$ regret with probability $1-\delta$, by incurring $O\left(H^{\alpha}SA \ln T\right)$ switching costs, where $\alpha=3$ and $2$, respectively. Recently,~\cite{qiao2022sample} obtained a similar $\tilde{O}(\sqrt{T})$ regret with probability $1-\delta$, by incurring $O\left( HSA \ln\ln T \right)$ switching costs. Moreover, for linear MDP (with $d$-dimensional feature space),~\cite{gao2021provably} and \cite{wang2021provably} obtained an $\tilde{O}\left( \sqrt{ d^3 H^3 T } \cdot (\ln \frac{dT}{\delta})^{1/2} \right)$ regret with probability $1-\delta$, by incurring $O\left( dH\ln T \right)$ switching costs.

\textbf{Adversarial MDPs:} Adversarial RL better models scenarios where the loss distributions and/or the transition functions of MDPs could change over time. Specifically, in tabular MDP with a known transition function,~\cite{zimin2013online} proposed an RL algorithm that attains an $\tilde{O}(\sqrt{HSAT})$ regret. In the case with an unknown transition function,~\cite{jin2020learning} and~\cite{lee2020bias} obtained an $\tilde{O}\left(HS\sqrt{AT \ln \frac{TSA}{\delta}}\right)$ regret with probability $1-\delta$. These studies assume that the state spaces of layers in an episode are non-overlapping. Moreover,~\cite{rosenberg2019online} studied the case with full-information feedback. Adversarial linear MDP has also been studied recently, e.g., in~\cite{cai2020provably,luo2021policy}. In addition,~\cite{yu2009arbitrarily,cheung2019reinforcement} and~\cite{lykouris2021corruption} studied the case when both the loss distribution and transition function change arbitrarily. More studies on various adversarial RL settings have been done by~\cite{rosenberg2019onlinenips,lee2021achieving,zhao2021linear,jin2021best,he2022nearly}, etc.

To the best of our knowledge, no study in the literature has addressed the challenge due to \emph{switching costs in adversarial RL}, which is the focus of this paper.









\section{Problem Formulation}\label{sec:problemformulation}

We consider adversarial reinforcement learning (RL) with switching costs in episodic Markov decision processes (MDPs). Suppose there are $T$ episodes, each of which consists of $H$ layers. We use $\mathcal{S}_{h}$ to denote the state space of layer $h$. For ease of elaboration, as in previous work (e.g.,~\cite{zimin2013online,jin2020learning} and~\cite{lee2020bias}), we assume that the $H$ layers are non-intersecting, i.e., $\mathcal{S}_{h'} \cap \mathcal{S}_{h''} = \phi$ for any $h'\neq h''$; $\mathcal{S}_{0} = \{s_{0}\}$ is a singleton; and each episode ends at state $\mathcal{S}_{H} = \{s_{H}\}$. Thus, the entire state space is $\mathcal{S} = \cup_{h=0}^{H} \mathcal{S}_{h}$ with size $S = \sum_{h=0}^{H} S_{h}$, where $S_{h}$ denotes the size of $\mathcal{S}_{h}$. Moreover, we use $\mathcal{A}$ to denote the action space with size $A$. Then, the MDP is defined by a tuple $\left( \mathcal{S}, \mathcal{A}, P, \left\{l_{t}\right\}_{t=1}^{T}, H \right)$, where $P$ is the transition function with $P_{h}:~\mathcal{S}_{h+1} \times \mathcal{S}_{h} \times \mathcal{A} \rightarrow [0,1]$ denoting the transition probability measure at layer $h$, and $l_{t}:~\mathcal{S} \times \mathcal{A} \rightarrow [0,1]$ represents the loss function for episode $t$.

The online learner interacts with the Markov environment episode-by-episode as follows. At the beginning of each episode $t=1$, ..., $T$, the online learner starts from state $s_{0}$ and follows an algorithm that (possibly randomly) chooses a \emph{deterministic} policy $\pi_{t}: \mathcal{S} \rightarrow \mathcal{A}$. Next, at each layer $h=0$, ..., $H-1$, after observing the current sate $s_{t,h}$, the learner chooses an action $a_{t,h} = \pi_{t}(s_{t,h})$. Then, the learner incurs a loss $l_{t}(s_{t,h},a_{t,h})$. Finally, the next state $s_{t,h+1} \in \mathcal{S}_{h+1}$ is drawn according to the transition probability $P(\cdot|s_{t,h},a_{t,h})$. (For simplicity, we drop the index $h$ of $P_{h}$ in this paper when it is clear from the context.) These steps repeat until the learner arrives at the last state $s_{H}$. At the end of episode~$t$, only the losses of visited state-action pairs in the episode are observed by the learner, whereas the losses of non-visited state-action pairs are unknown. As in~\cite{zimin2013online,jin2020learning,lee2020bias,cai2020provably}, this is called ``\textbf{bandit feedback}'', which is more practical than full-information feedback~\cite{rosenberg2019online} that assumes the losses of all state-action pairs (no matter visited or not) are known for free.

\textbf{Adversarial losses:} Different from static RL that assumes the loss distribution is fixed for all episodes, in the adversarial setting we consider here, we do not need any assumption on the underlying loss distribution. That is, the loss function $l_{t}$ could change arbitrarily across episodes.

\textbf{Switching costs:} As we mentioned in the introduction, in adversarial RL, addressing switching costs remains an open problem. The switching cost refers to the cost needed for changing the policy~$\pi_{t}$. It is equal to $\beta \cdot \mathbf{1}_{\{\pi_{t+1}\neq \pi_{t}\}}$, where $\beta$ is the switching-cost coefficient ($\beta$ is strictly positive and is independent of $T$) and $\mathbf{1}_{\mathcal{E}}$ is a indicator function (i.e., $\mathbf{1}_{\mathcal{E}} = 1$ if the event $\mathcal{E}$ occurs, and $\mathbf{1}_{\mathcal{E}} = 0$ otherwise).

Therefore, the total cost of executing an RL algorithm $\pi$ over $T$ episodes is given by
\begin{align}
\text{Cost}^{\pi}(1:T) \triangleq \mathbb{E} \left[ \sum\limits_{t=1}^{T} \sum\limits_{h=0}^{H-1} l_{t}(s_{t,h}^{\pi},a_{t,h}^{\pi}) + \sum\limits_{t=1}^{T-1} \beta \cdot \mathbf{1}_{\{\pi_{t+1} \neq \pi_{t}\}} \Big | \pi,P \right], \label{eq:definetotalonlinecost}
\end{align}
where the expectation is taken with respect to the randomness of the state-action pairs $(s_{t,h}^{\pi},a_{t,h}^{\pi})$ visited by $\pi$, and the possible randomness of changing the policy $\pi_{t}$.










Next, we introduce a concept called ``occupancy measure''~\cite{zimin2013online,jin2020learning}. Specifically, the occupancy measure $q_{t}^{\pi,P}(s,a) = Pr[s_{t,h}^{\pi}=s,a_{t,h}^{\pi}=a | \pi,P] \geq 0$ is the probability of visiting the state-action pair~$(s,a)$ by the algorithm $\pi$ at layer $h$ of episode $t$ under the transition function $P$. In addition (with slight abuse of notation), the occupancy measure $q_{t}^{\pi,P}(s',s,a) = Pr[s_{t,h+1}^{\pi}=s', s_{t,h}^{\pi}=s,a_{t,h}^{\pi}=a | \pi,P] \geq 0$ is the probability of visiting the state-action triple $(s',s,a)$ by the algorithm~$\pi$ at layers $h$ and $h+1$ of episode $t$ under the transition function $P$. In order to be feasible, the occupancy measures need to satisfy some conditions at layer $h$ of episode~$t$. First, according to probability theory, they need to satisfy the conditions that,
\begin{align}
q_{t}^{\pi,P}(s,a) = \sum\limits_{s' \in \mathcal{S}_{h+1}} q_{t}^{\pi,P}(s',s,a), \text{ for all } (s,a) \in \mathcal{S}_{h}\times \mathcal{A}, \text{ and }
\sum\limits_{s\in\mathcal{S}_{h}} \sum\limits_{a\in\mathcal{A}} q_{t}^{\pi,P}(s,a) = 1. \label{eq:defineconstraintoccupancymeasure1}
\end{align}
Second, since the probability of transferring to a state $s$ from the previous layer $h-1$ must be equal to the probability of transferring from this state $s$ to the next layer $h+1$, we have
\begin{align}
\sum_{s'\in\mathcal{S}_{h-1}} \sum_{a\in \mathcal{A}} q_{t}^{\pi,P}(s,s',a) = \sum_{s'\in\mathcal{S}_{h+1}} \sum_{a\in \mathcal{A}} q_{t}^{\pi,P}(s',s,a), \text{ for all } s\in \mathcal{S}_{h}. \label{eq:defineconstraintoccupancymeasure2}
\end{align}
Third, the occupancy measure should generate the true transition function $P$, i.e.,
\begin{align}
\frac{q_{t}^{\pi,P}(s',s,a)}{\sum_{b\in\mathcal{A}} q_{t}^{\pi,P}(s',s,b)} = P_{h}(s'|s,a), \text{ for all } (s',s,a)\in \mathcal{S}_{h+1} \times \mathcal{S}_{h} \times \mathcal{A}. \label{eq:defineconstraintoccupancymeasure3}
\end{align}
We use $\mathbb{C}(P)$ to denote the set of all occupancy measures that satisfy conditions (\ref{eq:defineconstraintoccupancymeasure1})-(\ref{eq:defineconstraintoccupancymeasure3}). Moreover, at the beginning of each episode $t$, the algorithm $\pi$ associated with the occupancy measure~$q_{t}^{\pi,P}$ chooses a \emph{deterministic} policy $\pi_{t}$ by assigning an action $a\in\mathcal{A}$ to each state $s\in \mathcal{S}$ according to the probability
\begin{align}
Pr[a | s] = \frac{q_{t}^{\pi,P}(s,a)}{\sum_{b\in\mathcal{A}} q_{t}^{\pi,P}(s,b)}. \label{eq:definerelationpolicyoccupancy}
\end{align}
Then, it is not hard to show that the expected total loss, i.e., the first term in (\ref{eq:definetotalonlinecost}), can be expressed as $\text{loss}^{\pi}(1:T) \triangleq \mathbb{E} \left[ \sum_{t=1}^{T} \langle q_{t}^{\pi,P},l_{t} \rangle \Big| \pi,P \right]$. Finally, the regret of an RL algorithm~$\pi$ is defined to be the sum of the loss regret $R_{\text{loss}}^{\pi}(T)$ and the switching costs of as follows:
\begin{align}
R^{\pi}(T) \triangleq \underbrace{\max_{q\in \mathbb{C}(P)} \mathbb{E}\left[ \left. \sum_{t=1}^{T} \langle q_{t}^{\pi,P} - q,l_{t} \rangle \right| \pi,P \right]}_{\text{loss regret:}\; R_{\text{loss}}^{\pi}(T)}+\underbrace{ \mathbb{E}\left[ \sum_{t=1}^{T-1} \beta \cdot \mathbf{1}_{\{\pi_{t+1} \neq \pi_{t}\}} \Big| \pi,P \right]}_\text{switching costs}. \label{eq:defineregret}
\end{align}
Therefore, our goal in this paper is to design RL algorithms that achieve as low regret as possible against any possible sequence of loss functions $\left\{l_{t}\right\}_{t=1}^{T}$ and state transition function $P$.










\section{A Lower Bound} \label{sec:lowerbound}

In this section, we will develop a lower bound on the regret for adversarial RL with switching costs. Such a lower bound will quantify how difficult it is to control the regret with switching costs under adversarial RL. In Theorem~\ref{theorem:lowerbound} below, we provide this lower bound, the proof of which is given in Appendix~\ref{appendix:proofoftheoremlowerbound}. (In Sec.~\ref{sec:knowntransitionprob} and Sec.~\ref{sec:unknowntransitionprob}, we will provide two near-optimal RL algorithms to achieve this lower bound.)



\begin{theorem}\label{theorem:lowerbound}

For adversarial RL with switching costs and $T\geq\max{\{6H^2 SA,\beta\}}$, the regret of any RL algorithm $\pi$ can be lower-bounded as follows,
\begin{align}
R^{\pi}(T) \geq \tilde{\Omega}\left( \beta^{1/3} \left( H S A \right)^{1/3} T^{2/3} \right). \label{eq:theoremlowerbound}
\end{align}
\end{theorem}



Theorem~\ref{theorem:lowerbound} shows that in adversarial RL with switching costs, the dependency on $T$ of the best achievable regret is at least $\tilde{\Omega}( T^{2/3} )$. Thus, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in {\em static} RL with switching costs (in~\cite{bai2019provably,qiao2022sample}, etc) as well as adversarial RL \emph{without} switching costs (in~\cite{zimin2013online,jin2018q}, etc) is no longer achievable. This demonstrates the fundamental challenge of switching costs in adversarial RL, and it is expected that new challenges will arise when developing provably efficient algorithms.

Further, in Theorem~\ref{theorem:lossswitchingtradeoff} below, we characterize precisely the new trade-off between the loss regret and switching costs defined in (\ref{eq:defineregret}). The proof is provided in Appendix~\ref{appendix:proofoftheoremlossswitchingtradeoff}. Intuitively, by switching more, the online RL algorithm can adapt more flexibly to the new information learned, and thus achieves a lower loss regret. On the other hand, if fewer switches are allowed, the online RL algorithm is less flexible to adapt to the new information learned, which will incur a larger loss regret.



\begin{theorem}\label{theorem:lossswitchingtradeoff}

For adversarial RL with switching costs, with the switching costs equal to $O\left( \beta\cdot \mathcal{N}^{\text{swi}} \right)$, the loss regret can be lower-bounded by $\tilde{\Omega}\left( \sqrt{\frac{HSA}{\mathcal{N}^{\text{swi}}}}\cdot T \right)$. Alternatively, to achieve a loss regret equal to $\tilde{O}\left( \sqrt{\frac{HSA}{\mathcal{N}^{\text{swi}}}}\cdot T \right)$, the switching costs incurred have to be larger than $\Omega\left( \beta\cdot \mathcal{N}^{\text{swi}} \right)$.
\end{theorem}



Theorem~\ref{theorem:lossswitchingtradeoff} provides an interesting and necessary trade-off between the loss regret and switching costs. We further elaborate this result in three cases. \textbf{First}, in order to achieve a loss regret $\tilde{O} ( H\sqrt{SAT} )$, Theorem~\ref{theorem:lossswitchingtradeoff} shows that the number of switches $\mathcal{N}^{\text{swi}}$ (and thus the switching costs incurred) must be linear in $T$, i.e., essentially switching at almost all episodes. This is consistent with the regret achieved in adversarial RL \emph{without} switching costs, i.e., allowing switching linear-to-$T$ number of times for free. But our result further implies that, without linear-to-$T$ switches of the policy, it is impossible to achieve an $\tilde{O}( \sqrt{T} )$ loss regret. \textbf{Second}, Theorem~\ref{theorem:lossswitchingtradeoff} shows that, if only a constant or $O(\ln\ln T)$ number of switches are allowed, the loss regret must be linear in~$T$. In contrast, in \emph{static} RL, an $\tilde{O}(\sqrt{T})$ loss regret is achieved with only $O(\ln\ln T)$ switches~\cite{qiao2022sample}. This indicates that the adversarial nature of RL necessarily requires significantly more policy switches to achieve a low loss regret. \textbf{Third}, Theorem~\ref{theorem:lossswitchingtradeoff} suggests that the loss regret and switching costs can be balanced at the order of $\tilde{O}\left( T^{2/3} \right)$. That is, to achieve the $\tilde{O}\left( T^{2/3} \right)$ loss regret, the switching costs incurred have to be $\tilde{\Omega}\left( T^{2/3} \right)$. This is consistent with Theorem~\ref{theorem:lowerbound}, where the regret (including both the loss regret and switching costs) is lowered-bound by $\tilde{\Omega}\left( T^{2/3} \right)$.










\section{The Case when the Transition Function is Known}\label{sec:knowntransitionprob}

In this section, we study the case when the transition function is \emph{known}, and we will further explore the more challenging case when the transition function is \emph{unknown} in Sec.~\ref{sec:unknowntransitionprob}. We propose a novel algorithm (please see Algorithm~\ref{alg:ereps}) with a regret that matches the lower bound in (\ref{eq:theoremlowerbound}). Our algorithm is called Switching rEduced EpisoDic relative entropy policy Search (\text{SEEDS}).

\text{SEEDS}~is inspired by the episodic method in bandit learning~\cite{shi2022power}. In bandit learning, the idea is to divide the time horizon into $\Theta(T^{2/3})$ episodes, and pull one \emph{single} Exp3-arm in an episode. By doing so, the total switching cost is trivially $O(T^{2/3})$. Meanwhile, the loss regret in an episode is $\Theta(\eta \cdot (T^{1/3})^2)$, which is proportional to the loss variance in an episode. The final $O(T^{2/3})$ regret is then achieved by taking the sum of all these costs and tuning the parameter $\eta=\Theta(T^{-2/3})$. However, in the adversarial MDP setting that we consider, there is a key difference due to random state-action visitations that cause several new challenges as we discuss in the rest of this section.



\begin{algorithm}[t]
\caption{Switching rEduced EpisoDic relative entropy policy Search (\text{SEEDS})}
\begin{algorithmic}
\STATE \textbf{Parameters:} $\eta = \tilde{\Theta}\left( \beta^{-1/3} H^{2/3} (SA)^{-1/3} T^{-2/3} \right)$ and $\tau = \tilde{\Theta}\left( \beta^{2/3} (HSA)^{-1/3} T^{1/3} \right)$.
\STATE \textbf{Initialization:} $Pr[a|s]= \frac{1}{A}$ for all $(s,a) \in \mathcal{S}\times \mathcal{A}$. Choose $\pi_{[1]}^{\text{SEEDS}}$ according to (\ref{eq:definerelationpolicyoccupancy}).
\FOR{$u = 1 : \left\lceil \frac{T}{\tau} \right\rceil$}
\FOR{$t = (u-1)\tau+1 : \min\{u \tau, T\}$}
\STATE \textit{Step 1:} Execute the updated policy $\pi_{[u]}^{\text{SEEDS}} = \pi^{\hat{q}_{[u]}^{\text{SEEDS},P}}$.
\ENDFOR
\STATE At the end of super-episode $u$,
\STATE \textit{Step 2:} Estimate the losses $\hat{l}_{[u]}^{\text{SEEDS}}(s,a)$ for all $(s,a)$ according to (\ref{eq:knowntpupdateloss}).
\STATE \textit{Step 3:} Update the occupancy measure $\hat{q}_{[u+1]}^{\text{SEEDS},P}(s,a)$ according to (\ref{eq:knowntpupdateom}). Update the deterministic policy $\pi^{\hat{q}_{[u+1]}^{\text{SEEDS},P}}$ according to (\ref{eq:definerelationpolicyoccupancy}).
\ENDFOR
\end{algorithmic}
\label{alg:ereps}
\end{algorithm}



\textbf{Super-episode-based policy search:} \text{SEEDS}~divides the episodes into $\mathcal{U} = \left\lceil \frac{T}{\tau} \right\rceil$ super-episodes, where $\tau\in \mathbb{Z}_{++}$ is a tunable parameter and a strictly positive integer. Each super-episode includes $\tau$ consecutive episodes. For all episodes in each super-episode $u=1$, $...$, $\mathcal{U}$, \text{SEEDS}~uses the same policy $\pi^{\hat{q}_{[u]}^{\text{SEEDS},P}}$ (\textit{Step-1} in Algorithm~\ref{alg:ereps}) that was updated at the end of the last super-episode $u-1$, where $\hat{q}_{[u]}^{\text{SEEDS},P}$ is the updated occupancy measure (that we will introduce soon) of \text{SEEDS}~for super-episode $u$. Thus, \text{SEEDS}~switches the policy at most once in each super-episode.

\textbf{A novel idea for estimating the losses:} At the end of super-episode $u$, \text{SEEDS}~estimates the losses $l_{[u]}(s,a)$ of all state-action pairs in super-episode $u$. Here, it is instructive to see why the episodic importance-estimating method in adversarial bandit learning (i.e., without state transitions) does not apply to our problem. Note that due to state transitions in our more general MDP setting, we are not guaranteed to visit a \emph{single} state-action pair for the whole super-episode. A naive but intuitive solution may be pretending that each state-action pair visited in super-episode $u$ was the \emph{single} one visited. Then, we can let the estimated loss of each state-action pair $(s,a)$ to be $\hat{l}_{[u]}(s,a) = \frac{\bar{l}_{[u]}(s,a)}{1-(1-\hat{q}_{[u]}^{\text{SEEDS},P}(s,a))^{\tau}} \mathbf{1}_{\{(s,a)\text{ was visited in super-episode }u\}}$, where the numerator $\bar{l}_{[u]}(s,a) = \sum_{t=(u-1)\tau+1}^{u\tau} l_{t}(s,a) / \tau$ is the average loss of $(s,a)$. If we assume that the loss $l_{t}$ for all episodes $t$ in super-episode~$u$ were the same, according to the analysis in bandit learning and the inequality $1-(1-x)^{\tau} \geq x$ for all $0\leq x\leq 1$, this idea would have worked. However, the problem is that, inside super-episode $u$, the loss function $l_{t}$ for each episode $t$ could change arbitrarily. Thus, the estimated loss $\hat{l}_{[u]}(s,a)$ above is actually unknown and an ill-defined value.

To resolve the aforementioned difficulty due to randomly-visited state-action pairs and arbitrarily-changing loss functions, \text{SEEDS}~estimates the loss as follows (\textit{Step-2} in Algorithm~\ref{alg:ereps}),
\begin{equation}
\hat{l}_{[u]}^{\text{SEEDS}}(s,a) = \sum\limits_{j=1}^{J_{[\uu]}} \frac{l_{t_{j}(s,a)}(s,a)}{\hat{q}_{[u]}^{\text{SEEDS},P}(s,a)} \mathbf{1}_{\{(s,a)\text{ was visited in episodes }t_{1}(s,a), ..., t_{J_{[\uu]}}(s,a)\text{ of super-episode }u\}}, \label{eq:knowntpupdateloss}
\end{equation}
where $J_{[\uu]}$ is the maximum number of episodes that the state-action pair $(s,a)$ was visited in super-episode $u$. In other words, in super-episode $u$, this state-action pair $(s,a)$ was not visited in any other episode $t$, such that $t\in \{(u-1)\tau+1,...,u\tau\} / \{t_{1}(s,a), ..., t_{J_{[\uu]}}(s,a)\}$. Thus, \text{SEEDS}~estimates the losses based on the observable true losses in super-episode $u$. In this way, \text{SEEDS}~elegantly resolves the aforementioned difficulty due to the random state transitions and adversarial losses. 
Our novel idea in (\ref{eq:knowntpupdateloss}) may be of independent interest for other problems with state transitions and non-stationary or adversarial losses. Indeed, in Sec.~\ref{sec:unknowntransitionprob}, we will apply this idea to the case when the transition function is unknown.

In Lemma~\ref{lemma:knownexpectedlossunbias} below, we show that the estimated loss in (\ref{eq:knowntpupdateloss}) is an unbiased estimation of the true loss in super-episode $u$. This is an important property that we will exploit in our regret analysis. The proof of Lemma~\ref{lemma:knownexpectedlossunbias} is provided in Appendix~\ref{appendix:proofoflemmaknownexpectedlossunbias}. We use $\mathcal{F}_{[u]}$ to denote the $\sigma$-algebra generated by the observation of \text{SEEDS}~before super-episode $u$.



\begin{lemma}\label{lemma:knownexpectedlossunbias}

The conditional expectation of the estimated loss designed in (\ref{eq:knowntpupdateloss}) is equal to
\begin{align}
\mathbb{E} \left[ \hat{l}_{[u]}^{\text{SEEDS}}(s,a) \Big| \mathcal{F}_{[u]} \right] = l_{[u]}(s,a), \text{ for all } (s,a), \label{eq:knownexpectedlossunbias}
\end{align}
where the expectation is taken with respect to the randomness of the episodes $t_{1}(s,a)$, ..., $t_{J_{[\uu]}}(s,a)$, in which the state-action pair $(s,a)$ was visited, and $l_{[u]}(s,a) = \sum_{t=(u-1)\tau+1}^{\min\{u\tau,T\}} l_{t}(s,a)$ is the true loss of $(s,a)$ in super-episode~$u$.

\end{lemma}



\textbf{Updating the occupancy measure:} Finally, according to online mirror descent~\cite{rakhlin2009lecture,zimin2013online}, \text{SEEDS}~updates the occupancy measure $\hat{q}_{[u+1]}^{\text{SEEDS},P}(s,a)$ for all state-action pairs $(s,a) \in \mathcal{S}\times \mathcal{A}$ as follows (\textit{Step-3} in Algorithm~\ref{alg:ereps}),
\begin{align}
\hat{q}_{[u+1]}^{\text{SEEDS},P} = \argmin_{q\in \mathbb{C}(P)} \left\{ \eta \cdot \left\langle q, \hat{l}_{[u]}^{\text{SEEDS}} \right\rangle + D_{\text{KL}}\left( q \left\| \hat{q}_{[u]}^{\text{SEEDS},P} \right. \right) \right\}, \label{eq:knowntpupdateom}
\end{align}
where $D_{\text{KL}}(q\|q') \triangleq \sum\limits_{s\in\mathcal{S},a\in\mathcal{A}} q(s,a) \ln \frac{q(s,a)}{q'(s,a)} - \sum\limits_{s\in\mathcal{S},a\in\mathcal{A}} \left[ q(s,a) - q'(s,a) \right]$ is the unnormalized relative entropy between two occupancy measures $q$ and $q'$ on the space $\mathcal{S} \times \mathcal{A}$. Recall that $\mathbb{C}(P)$ is formulated by (\ref{eq:defineconstraintoccupancymeasure1})-(\ref{eq:defineconstraintoccupancymeasure3}). Note that the term $\langle q, \hat{l}_{[u]}^{\text{SEEDS}} \rangle$ represents the expected loss in super-episode $u$, with respect to the newly-estimated loss function $\hat{l}_{[u]}^{\text{SEEDS}}$. Thus, it captures how \text{SEEDS}~adapts to and explores the newly-estimated loss function. In addition, the term $D_{\text{KL}}(q\|\hat{q}_{[u]}^{\text{SEEDS},P})$ serves as a regularizer to ensure that the updated occupancy measure in (\ref{eq:knowntpupdateom}) stays close to $\hat{q}_{[u]}^{\text{SEEDS},P}$. Thus, it captures how \text{SEEDS}~exploits the previously-estimated loss functions before super-episode~$u$. As a result, by tuning the parameter $\eta$ in (\ref{eq:knowntpupdateom}), the updated occupancy measure strikes a balance between exploration and exploitation.

We characterize the regret of \text{SEEDS}~in Theorem~\ref{thm:knowntranprobregret} below.



\begin{theorem}\label{thm:knowntranprobregret}

Consider adversarial RL with switching costs introduced in Sec.~\ref{sec:problemformulation}. When the transition function $P$ is known, the regret of \text{SEEDS}~is upper-bounded as follows,
\begin{align}
R^{\text{SEEDS}}(T) \leq \tilde{O}\left( \beta^{1/3} \left( H S A \right)^{1/3} T^{2/3} \right). \label{eq:knowntranprobregret}
\end{align}

\end{theorem}



Theorem~\ref{thm:knowntranprobregret} shows that the regret of \text{SEEDS}~matches the lower bound in (\ref{eq:theoremlowerbound}) in terms of the dependency on all the parameters $T$, $S$, $A$, $H$ and $\beta$. Thus, the regret of \text{SEEDS}~is order-wise optimal. \emph{To the best of our knowledge, this is the first regret result for adversarial RL with switching costs.} To prove Theorem~\ref{thm:knowntranprobregret}, the main difficulty lies in capturing the effects of the arbitrarily-changing losses and multiple random visitations of each state-action pair in a super-episode. To overcome this difficulty, our new idea is to first upper-bound the loss regret based on the correlated loss feedback in a super-episode, and then relate these upper bounds across all super-episodes to the final regret. The first step relies on the proof of Lemma~\ref{lemma:knownexpectedlossunbias}, and the second step relies on another lemma in Appendix~\ref{subsec:proofofthmknowntranprobregret} that transfers the original regret formulation to a form based on the losses from the entire super-episode. Please see Appendix~\ref{appendix:proofofthmknowntranprobregret} for details and the proof of Theorem~\ref{thm:knowntranprobregret}.


Further, in Theorem~\ref{thm:seedstradeofflossswitch} below, we show that \text{SEEDS}~attains a trade-off between the loss regret and switching costs that matches the trade-off in Theorem~\ref{theorem:lossswitchingtradeoff}. The proof of Theorem~\ref{thm:seedstradeofflossswitch} follows the loss-regret bound of \text{SEEDS}~proved in Appendix~\ref{appendix:proofofthmknowntranprobregret} and the trivial switching-cost bound $\beta \cdot \left\lceil \frac{T}{\tau} \right\rceil$. Please see the end of Appendix~\ref{appendix:proofofthmknowntranprobregret} for details.



\begin{theorem}\label{thm:seedstradeofflossswitch}

Let $\mathcal{N}^{\text{SEEDS}} \triangleq \left\lceil \frac{T}{\tau} \right\rceil$. Then, with the switching costs equal to $O\left( \beta\cdot \mathcal{N}^{\text{SEEDS}} \right)$, \text{SEEDS}~can achieve a loss regret upper-bounded by $\tilde{O}\left( \sqrt{\frac{HSA}{\mathcal{N}^{\text{SEEDS}}}}\cdot T \right)$.

\end{theorem}










\section{The Case when the Transition Function is Unknown}\label{sec:unknowntransitionprob}



\begin{algorithm}[t]
\caption{\text{SEEDS}-Unknown Transition (\text{SEEDS-UT})}
\begin{algorithmic}
\STATE \textbf{Parameters:} $\eta = \tilde{\Theta}\left( \beta^{-1/3} H^{1/3} (SA)^{-1/3} T^{-2/3} \right)$, $\tau = \tilde{\Theta}\left( \beta^{2/3} H^{-2/3} (SA)^{-1/3} T^{1/3} \right)$, $\gamma = \tilde{\Theta}\left( \beta^{1/3} H^{2/3} (SA)^{-2/3} T^{-1/2} \right)$, and $0< \delta < 1$.
\STATE \textbf{Initialization:} $\hat{q}_{[1]}^{\text{SEEDS-UT},\mathcal{P}}(s',s,a) = \frac{1}{S_{h+1} S_{h} A}$ and $M_{[1]}(s',s,a) = N_{[1]}(s,a) = 0$, for all $(s',s,a) \in \mathcal{S}_{h+1} \times \mathcal{S}_{h} \times \mathcal{A}$ and all $h$. $\mathcal{P}_{[1]}$ contains all possible transition functions. Choose $\pi_{[1]}^{\text{SEEDS-UT}} = \pi^{\hat{q}_{[1]}^{\text{SEEDS-UT},\mathcal{P}}}$ according to (\ref{eq:defineconstraintoccupancymeasure1}) and (\ref{eq:definerelationpolicyoccupancy}).
\FOR{$u = 1 : \left\lceil \frac{T}{\tau} \right\rceil$}
\FOR{$t = (u-1)\tau+1 : \min\{u \tau, T\}$}
\STATE \textit{Step 1:} Execute the updated policy $\pi_{[u]}^{\text{SEEDS-UT}} = \pi^{\hat{q}_{[u]}^{\text{SEEDS-UT},\mathcal{P}}}$.
\ENDFOR
\STATE At the end of super-episode $u$,
\STATE \textit{Step 2:} Estimate the losses $\hat{l}_{[u]}^{\text{SEEDS-UT}}(s,a)$ for all $(s,a)$ according to (\ref{eq:unknowntpupdateloss}).
\STATE \textit{Step 3:} Estimate the transition-function set $\mathcal{P}_{[u+1]}$ according to (\ref{eq:unknowntpupdatetransition}).
\STATE \textit{Step 4:} Update the occupancy measure $\hat{q}_{[u+1]}^{\text{SEEDS-UT},\mathcal{P}}(s',s,a)$ according to (\ref{eq:knowntpupdateom}), but subject to a different constraint $q\in \mathbb{C}\left(\mathcal{P}_{[u+1]}\right)$. Update the deterministic policy $\pi^{\hat{q}_{[u+1]}^{\text{SEEDS-UT},\mathcal{P}}}$ according to (\ref{eq:defineconstraintoccupancymeasure1}) and (\ref{eq:definerelationpolicyoccupancy}).
\ENDFOR
\end{algorithmic}
\label{alg:erepss}
\end{algorithm}



In this section,  we study a more challenging case when the transition function is \emph{unknown}. We propose a novel algorithm (please see Algorithm~\ref{alg:erepss}) with a regret that matches the lower bound in  (\ref{eq:theoremlowerbound}) in terms of the dependency on all parameters, except with a small factor of $\tilde{O}(H^{1/3})$. Specifically, to address the new difficulty due to the \emph{unknown} transition function $P$ in this case, we advance SEEDS into SEEDS-UT (where UT stands for ``unknown transition") with three new components as we explain below.

1. Since the transition function $P$ is unknown, updating the occupancy measure $\hat{q}(s,a)$ (as in \text{SEEDS}) is not good enough. Instead, \text{SEEDS-UT}~updates the occupancy measure $\hat{q}(s',s,a)$ to take state transitions into consideration.

2. Since the transition function $P$ is unknown, the updated occupancy measure could be different from the true one. To resolve this issue, we generalize the method in~\cite{neu2015explore}, with a key difference to handle the random sequence of the state-action pairs visited in each  super-episode. Specifically, \text{SEEDS-UT}~estimates the loss for each super-episode $u$ as follows (\textit{Step-2} in Algorithm~\ref{alg:erepss}),
\begin{align}
\hat{l}_{[u]}^{\text{SEEDS-UT}}(s,a) = \sum\limits_{j=1}^{J_{[\uu]}} \frac{l_{t_{j}(s,a)}(s,a)}{\mathcal{Q}_{[u]}^{\gamma}(s,a)} \mathbf{1}_{\{(s,a)\text{ was visited in episodes }t_{1}(s,a), ..., t_{J_{[\uu]}}(s,a)\text{ of super-episode }u\}}, \label{eq:unknowntpupdateloss}
\end{align}
where $\mathcal{Q}_{[u]}^{\gamma}(s,a) \triangleq \max_{q\in \mathbb{C}(\mathcal{P}_{[u]})} q(s,a) + \gamma$ is the sum of the largest probability of visiting $(s,a)$ among all occupancy measures in $\mathbb{C}(\mathcal{P}_{[u]})$ and a tunable parameter $\gamma>0$, and $\mathcal{P}_{[u]}$ is a transition-function set that we will introduce soon. Note that (\ref{eq:unknowntpupdateloss}) is another application of our idea in (\ref{eq:knowntpupdateloss}) for estimating losses in a problem with state transitions and adversarial losses.

In Lemma~\ref{lemma:unknownexpectedlossunbias} below, we show that the gap between the expectation of the estimated loss and the true loss is controlled by the parameter $\gamma$. The proof of Lemma~\ref{lemma:unknownexpectedlossunbias} is provided in Appendix~\ref{appendix:proofoflemmaunknownexpectedlossunbias}. We use $\mathcal{F}_{[u]}$ to denote the $\sigma$-algebra generated by the observation of \text{SEEDS-UT}~before super-episode $u$.



\begin{lemma}\label{lemma:unknownexpectedlossunbias}

The conditional expectation of the estimated loss designed in (\ref{eq:unknowntpupdateloss}) is equal to
\begin{align}
\mathbb{E} \left[ \left. \hat{l}_{[u]}^{\text{SEEDS-UT}}(s,a) \right| \mathcal{F}_{[u]} \right] = \frac{q_{[u]}^{\text{SEEDS-UT},P}(s,a)}{\max_{q\in \mathbb{C}(\mathcal{P}_{[u]})} q(s,a) + \gamma} \cdot l_{[u]}(s,a), \text{ for all } (s,a), \label{eq:unknownexpectedlossunbias}
\end{align}
where the expectation is taken with respect to the randomness of the episodes $t_{1}(s,a)$, ..., $t_{J_{[\uu]}}(s,a)$, in which $(s,a)$ was visited, $q_{[u]}^{\text{SEEDS-UT},P}(s,a)$ is the true occupancy measure of \text{SEEDS-UT}~conditioned on $\mathcal{F}_{[u]}$, and $l_{[u]}(s,a) = \sum_{t=(u-1)\tau+1}^{\min\{u\tau,T\}} l_{t}(s,a)$ is the true loss of $(s,a)$ in super-episode~$u$.

\end{lemma}



Lemma~\ref{lemma:unknownexpectedlossunbias} shows that, as long as $\mathcal{P}_{[u]}$ is sufficiently good for estimating the true transition function $P$ (we will show how to construct such a $\mathcal{P}_{[u]}$ below), by carefully tuning $\gamma$, the bias caused by $ \max_{q\in \mathbb{C}(\mathcal{P}_{[u]})} q(s,a) + \gamma$ (i.e., $\mathcal{Q}_{[u]}^{\gamma}(s,a)$) should be sufficiently small, so that the estimated loss is still sufficiently accurate.



3. Since the transition function $P$ is unknown, the constraint in (\ref{eq:knowntpupdateom}) is no longer known. To resolve this issue, we generalize the method in~\cite{jin2020learning}, with a difference to handle the samples from the whole super-episode. Specifically, at the end of each super-episode, \text{SEEDS-UT}~collects the samples from the whole super-episode to update the empirical transition probability $\bar{P}_{[u+1]}(s'|s,a) = \frac{M_{[u+1]}(s',s,a)}{\max\left\{N_{[u+1]}(s,a),1\right\}}$, where $M_{[u+1]}(s',s,a)$ and $N_{[u+1]}(s,a)$ denote the number of times visiting $(s',s,a)$ and $(s,a)$ before super-episode $u+1$, respectively. Then, based on the empirical Bernstein bound~\cite{maurer2009empirical}, \text{SEEDS-UT}~constructs a transition-function set $\mathcal{P}$ as follows (\textit{Step-3} in Algorithm~\ref{alg:erepss}),
\begin{align}
\mathcal{P}_{[u+1]} = \left\{ \hat{P}_{[u+1]}: \left| \hat{P}_{[u+1]}(s'|s,a) - \bar{P}_{[u+1]}(s'|s,a) \right| \leq \epsilon_{[u+1]}(s',s,a), \text{ for all } (s',s,a) \right\}, \label{eq:unknowntpupdatetransition}
\end{align}
where $\epsilon_{[u+1]}(s',s,a) = 2\sqrt{\frac{\bar{P}_{[u+1]}(s',s,a)\ln\frac{TSA}{\delta}}{\max\left\{N_{[u+1]}(s,a)-1,1\right\}}} + \frac{14\ln\frac{TSA}{\delta}}{3\max\left\{N_{[u+1]}(s,a)-1,1\right\}}$, and $\delta \in (0,1)$ is the confidence parameter. Finally, the occupancy measure $\hat{q}_{[u+1]}^{\text{SEEDS-UT},\mathcal{P}}(s',s,a)$ is updated according to (\ref{eq:knowntpupdateom}), but subject to a different constraint $q\in \mathbb{C}\left(\mathcal{P}_{[u+1]}\right)$ (\textit{Step-4} in Algorithm~\ref{alg:erepss}).

We characterize the regret of \text{SEEDS-UT}~in Theorem~\ref{thm:unknowntranprobregret} below.



\begin{theorem}\label{thm:unknowntranprobregret}

Consider adversarial RL with switching costs introduced in Sec.~\ref{sec:problemformulation}. When the transition function $P$ is unknown, with probability $1-\delta$, the regret of \text{SEEDS-UT}~is upper-bounded as follows,
\begin{align}
R^{\text{SEEDS-UT}}(T) \leq \tilde{O}\left( \beta^{1/3} H^{2/3} \left( S A \right)^{1/3} T^{2/3} \left(\ln\frac{TSA}{\delta}\right)^{1/2} \right). \label{eq:unknowntranprobregret}
\end{align}

\end{theorem}



Theorem~\ref{thm:unknowntranprobregret} shows that the regret of \text{SEEDS-UT}~matches the lower bound in (\ref{eq:theoremlowerbound}) in terms of the dependency on $T$, $S$, $A$, and $\beta$, except with a small factor of $\tilde{O}(H^{1/3})$. That is, the regret of \text{SEEDS-UT}~is near-optimal. \emph{To the best of our knowledge, this is the first regret result for adversarial RL with switching cost when the transition function is unknown.} To prove Theorem~\ref{thm:unknowntranprobregret}, the main difficulty is that, due to the delayed switching and unknown transition function, the losses of \text{SEEDS-UT}~in the episodes of any super-episode are highly-correlated and the true occupancy measure is unknown. As a result, the existing analytical ideas in adversarial RL without switching costs and adversarial bandit learning with switching costs do not work here. To overcome these new difficulties, our analysis involves several new ideas, e.g., we construct a series in (\ref{eq:sketchproofunknownregretdecompose2ii8}) to handle multiple random visitations of each state-action pairs, and we establish a super-episodic version of concentration in \textit{Step-2-iii} of Appendix~\ref{appendix:proofofthmunknowntranprobregret} by relating the second-order moment of the estimated loss that we design to the true loss and the length $\tau$ of a super-episode. Please see Appendix~\ref{appendix:proofofthmunknowntranprobregret} for the detailed proof of Theorem~\ref{thm:unknowntranprobregret}.










\section{Conclusion and Future Work}\label{sec:conclusion}

In this paper, we make the first effort towards addressing the challenge of switching costs in adversarial RL. First, we provide a lower bound that shows that the best achieved regret in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. In addition, we characterize precisely the new trade-off between the loss regret and switching costs, which shows that the adversarial nature of RL necessarily requires more switches to achieve a low loss regret. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}(H^{1/3})$ when the transition function is unknown.

Several future directions are worth pursuing. First, it is important to study adversarial RL with switching costs in linear and more general MDP settings. Another interesting future work is to extend our study to the dynamic regret, which allows the optimal policy to change over time.









