% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{subfigure}
\input{defpack_uai}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Optimistic Regret Bounds for Online Learning in \\Adversarial Markov Decision Processes}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<moon182@purdue.edu>?Subject=Your UAI 2024 paper}{Sang Bin Moon}{}}
\author[1]{Sang Bin Moon}
\author[1]{Abolfazl Hashemi}
% Add affiliations after the authors
\affil[1]{%
    School of Electrical and Computer Engineering\\
    Purdue University\\
    West Lafayette, Indiana, USA
}
  
  \begin{document}
\maketitle

\begin{abstract}
    The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems.
    A major limitation of the AMDP formalism, however, is pessimistic regret analysis results in the sense that although the cost function can change from one episode to the next, the evolution in many settings is not adversarial.
    To address this, we introduce and study a new variant of AMDP, which aims to minimize regret while utilizing a set of cost predictors.
    For this setting, we develop a new policy search method that achieves a sublinear optimistic regret with high probability, that is a regret bound which gracefully degrades with the estimation power of the cost predictors.
    Establishing such optimistic regret bounds is nontrivial given that (i) as we demonstrate, the existing importance-weighted cost estimators cannot establish optimistic bounds, and (ii) the feedback model of AMDP is different (and more realistic) than the existing optimistic online learning works.
    Our result, in particular, hinges upon developing a novel optimistically biased cost estimator that leverages cost predictors and enables a high-probability regret analysis without imposing restrictive assumptions.
    We further discuss practical extensions of the proposed scheme and demonstrate its efficacy numerically.
\end{abstract}

\section{Introduction}
% MDP and the motivation for AMDP
Reinforcement learning studies the problem of sequential decision-making modeled as a Markov Decision Process (MDP), where a learner interacts with an environment and solves the optimal policy that minimizes the cumulative cost incurred by the environment. The learner interacts with the environment by observing a state, choosing an action, and suffering a cost, repeatedly for a finite number of time steps. The process is sequential in the sense that the chosen action affects the environment state, and thus the next state is observed through a stochastic transition probability function, and the cost suffered by the learner is determined by an unknown cost function accordingly. After a number of episodes, one can measure the performance of the learner's policy with regret, i.e., how larger the total cost suffered by the learner is compared to the total cost of a fixed optimal policy in hindsight. MDPs are useful for decision-making in various fields, such as robotics \citep{akkaya2019solving}, finance \citep{wei2019model, buehler2019deep}, and healthcare \citep{tsoukalas2015data}. However, in many real-world applications, the tasks and environment may change over time, leading to non-stationary dynamics. In such cases, the assumptions of MDP may not hold, and the performance of the decision-making system may deteriorate.

%AMDP and the motivation for optimistic regret bounds
In this paper, we consider the problem of learning policies in Adversarial MDP (AMDP) as a generalization of the traditional MDP model, where the environment can choose different cost functions for each episode. AMDP gives greater flexibility to account for changing environments and even the existence of other agents. For example, AMDP can model an energy-efficient drone navigation problem \citep{hong2021energy}, where wind incurs higher energy consumption while it is not observed in advance and changes arbitrarily. Stochastic inventory control \citep{even2009online} can also be modeled as AMDP, because item price and inventory cost change from time to time due to economic conditions. Eventually, AMDP can be extended to hierarchical or multi-agent problems, because parent policy or other agents evolve and incur different costs to a learner. Existing online learning  \citep{even2009online, yu2009markov, zimin2013online, neu2010online, neu2010online1, neu2014online, jin2020learning} and policy optimization approaches \citep{shani2020optimistic, luo2021policy} to AMDP solves the optimization problem to minimize the cost in hindsight. However, it can be too restrictive and result in conservative regret bounds. For instance, in multiplayer games, the action, and in turn the policies of other players may be predicted from simulation and historical observation; this insight if leveraged properly may lead to turning the game to a specific player's advantage \citep{vundurthy2023intelligent}.

%our problem/approach
Motivated by this shortcoming, we propose to study a new formulation for RL with time-varying cost functions where the aim is to learn a policy that minimizes its regret while resorting to a given set of time-varying predictive estimators of the cost functions, denoted by $\{c_t\}_{t=1}^T$ and $\{M_t\}_{t=1}^T$, respectively. We propose a novel policy search scheme that utilizes the set of optimistic cost predictors and achieves sub-linear regret bounds. Specifically, we make the following contributions:
\begin{itemize}
\item  We show the worst-case regret bound of $\tilde{\O}\left(\sqrt{d\left(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T\right)}\right)$ for the full-information feedback setting\footnote{Recall the notation $\tilde{\O}(\, \cdot \,)$ hides the logarithmic terms in its argument.} and $\tilde{\O}\left(d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)^{2/3}\right)$ in expectation for bandit feedback setting, where $d(\cdot,\cdot)$ captures cumulative estimation error of the cost predictors. It is also shown that with high probability the algorithm achieves the regret bound of $\tilde{\O}\left(d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)^{3/4}\right)$. These regret bounds are optimistic in nature, i.e., the bound scales with the prediction power of optimistic cost predictors, and can lead to constant regret with perfect prediction. In the worst case, on the other hand, the proposed scheme to learn a policy satisfies sublinear regret bounds.
\item Crucial to the establishment of these results is the development of a new cost estimator. This new estimator leverages the bandit information about the cost as well as the set of predictive estimators to update the policy. We show the proposed estimator has variance-reduction benefits and thus it may be of independent interest in similar problems.
\item We also introduce the anytime extensions for continuous training beyond the fixed number of episodes and establish similar regret guarantees. Then we generalize the setting to the unknown transition setting and establish high probability regret bounds by leveraging the idea of transition estimation via confidence sets.
\end{itemize}

\section{Background and Related Work}\label{sec:back}
We start with the precise definition of an AMDP. A standard definition follows an episodic loop-free AMDP \citep{zimin2013online} or a loop-free stochastic shortest path \citep{neu2012adversarial}.
\begin{definition}\label{def:amdp}
	An episodic loop-free Adversarial Markov decision process (AMDP) is a tuple $\mdp = (\st,\act,\tr,L,\{\cost_t\}_{t=1}^T)$ which consists of a finite discrete state space denoted by $\st$, a finite discrete action space denoted by $\act$,  a probabilistic transition function denoted by ${\tr : \st \times \act \times \st \to [0,1]}$, and a sequence of cost functions denoted by $\cost_t : \st \times \act \to \R$ such that:
		\begin{itemize}
		\item The cost functions are bounded, that is, $\cost_t \in [0,1]^{|\st|\times|\act|}$ for $t=1,2,\dots,T$.
    	\item The state space $\st$ is partitioned into $L$ non-overlapping layers $\st_0, \st_1, \ldots, \st_L$ such that $\st = \cup_{l=0}^{L} \st_l$ and, it holds that $\st_{l_1} \cap \st_{l_2} = \emptyset$ for any $l_1 \neq l_2$.
        \item The state transition function $\P(x'|x,a)$ is stationary.
    	\item If for some $x \in \st_l$ and some layer $l \in {\{0,\ldots,L-1\}}$, $\P(x'|x,a) > 0$, then $x' \in \st_{l+1}$; that is, state transition happens only between two consecutive layers.
        \item $\st_0$ and $\st_L$ are singletons; that is, $\st_0=\{x_0\}$ and $\st_L=\{x_L\}$.
	\end{itemize}
\end{definition}



\paragraph{Policy search in AMDP.} Online learning approaches to MDP, such as Follow-the-Regularized-Leader (FTRL) or Online Mirror Descent (OMD), solve the linear optimization problem with occupancy measure $\rho$. Occupancy measure quantifies the joint probability of the probability of visiting a state $x$ and the probability of taking an action $a$ given the state. Thus, conversely, an occupancy measure controls the behavior of an agent under a stationary, stochastic, and known/unknown transition probability distribution. The behavior is governed by the policy $\pi$ defined as
\begin{equation}\label{eq:induced_policy}
    \pi_t(a|x) = \frac{\rho_t(x,a)}{\sum_{a'\in\act}\rho_t(x,a')}.
\end{equation}
Therefore, given an MDP, the optimization objective is to minimize the total cost suffered by an occupancy measure. Since occupancy measure quantifies the probability of a specific state and action pair, the total (expected) cost can be formulated by a linear objective function with respect to a cost function $c$, i.e., $\langle\rho_t,c_t\rangle = \sum_{x\in\st,a\in\act}c_t(x,a)\rho_t(x,a)$. This leads to the following definition of regret (w.r.t. the policy corresponding to $\rho$) that underlies the problem of learning policies in AMDPs,
\begin{equation}
	\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) = \sum_{t=1}^{T} \langle \rho_t-\rho^*,\cost_t \rangle.
\end{equation}
Here, ${\rho \in \Delta(\mdp)} $ where $\Delta(\mdp)$ denote the space of all occupancy measures
over AMDP $\mathcal{M}$, $\langle .,. \rangle$ represents the Euclidean inner product over the space of $\st \times \act$, and $\rho_t$ denotes the agent's selected occupancy measure in episode $t$. 

OREPS \citep{zimin2013online} is the baseline algorithm for learning policies in AMDPs that solves the constrained, regularized regret minimization problem via a mirror descent update with stepsize $\eta$, i.e., $\rho_{t+1} = \argmin_{\rho \in \Delta(\mdp)} \eta\langle\rho, c_t\rangle + D_R(\rho\|\rho_t)$, where $R$ is negative entropy
\begin{equation*}
    R(\rho) = \sum_{x \in \st,a \in \act} \rho(x,a) \log \rho(x,a) - \sum_{x \in \st,a \in \act} \rho(x,a),
\end{equation*}
and $D_R$ is the unnormalized KL divergence being the corresponding Bregman divergence \citep{abernethy2009beating,lattimore2018bandit}
\begin{equation}\label{eq:kl}
\begin{aligned}
    D_R(\rho \| \rho') &= \sum_{x \in \st,a \in \act} \rho(x,a) \log \frac{\rho(x,a)}{\rho'(x,a)} \\
    &\quad- \sum_{x \in \st,a \in \act} \left(\rho(x,a) - \rho'(x,a)\right).
\end{aligned}
\end{equation}
KL divergence regularizes the information loss from the history that previous solutions were optimized for. OREPS solves the unconstrained version of the original problem and the dual formulation of the projection onto $\Delta(\mdp)$.

\paragraph{Optimistic online learning.} Let $\{M_t\}_{t=1}^T$ be a sequence of time varying predictive estimators such that  $M_t : \st \times \act  \to [0,1]$ for all $t$. For online linear optimization, \cite{rakhlin2013online} show that optimistic mirror descent (OMD) \citep{chiang2012online} equipped with a similar cost predictor sequence can achieve optimistic regret bounds, i.e., $\tilde{\O}(\sqrt{d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)})$, where $d(\cdot,\cdot)$ captures cumulative estimation error of the cost predictors. This result shows with perfect estimation the regret is $\tilde{\O}(1)$ while for futile estimation, i.e., the worst case, the regret is $\tilde{\O}(\sqrt{T})$. In this paper, we aim to establish optimistic regret bounds for a class of policy search methods in AMDPs. In contrast to \cite{rakhlin2013online}, our setting is more general in the sense that it accounts for the dynamic and state-full nature of the interaction between the learner and the environment which is captured by the notion of state space. Further, although \cite{rakhlin2013online} leverages the method from \cite{abernethy2012interior} to propose a no-regret scheme for the bandit setting in online linear optimization, their algorithm is not applicable in our setting since the bandit feedback model of the present paper is different from \cite{rakhlin2013online} and more meaningful in the sense that the learner observes the cost of the chosen action, not the mixture of cost of all feasible actions. Consequently, the proposed method and its analysis differ considerably from \cite{rakhlin2013online}. Further, we leverage a single-projection method adopted from \cite{joulani2017modular} to reduce the computational cost of optimistic policy search compared to OMD which requires two projection steps.

\paragraph{Bandit cost estimation.} Learning a policy in the bandit case relies on estimating the unknown cost function for each episode. Given the connection of AMDPs to adversarial bandits, \cite{zimin2013online} incorporate the celebrated importance-weighted cost estimator in  OREPS which was originally exhibited in the EXP3 algorithm \citep{cesa2006prediction}. Recently, \cite{jin2020learning,ghasemi2021no-regret} have utilized the implicit exploration estimator from \cite{neu2015explore}, i.e.,
\begin{equation}\label{eq:cost-ix}
    \rvcosth_t'(x,a) = \frac{\cost_t(x,a)}{\rho_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\},
\end{equation}
in a similar OREPS-based update, where $\gamma \geq 0$ is the exploration parameter and $\bar{\u}_L(t)$ denotes the history of states and actions up to and including the $L^\ts{th}$ layer of episode $t$. As we discuss later, such estimators fail to result in optimistic regret guarantees that degrade gracefully with $d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)$. Thus, we develop a new cost estimator, characterize its properties, and show that it results in optimistic bounds.

\section{Optimistic Learning in AMDPs}\label{sec:alg}
Given that in the bandit setting, we need to resort to cost estimation, the estimation error of the estimator is an integral part of the regrets of the underlying algorithms. In order to establish optimistic bounds, our regret analysis shows that it is crucial to have an estimator whose error is controlled with $d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)$. Let us consider the estimator \eqref{eq:cost-ix}, define $\E_t[\,\cdot\,] = \E[\,\cdot\,|\u(t)]$, and examine $\E_{t-1}\|\rvcosth_t'-M_t\|^2$ which can be thought of as some notion of variance. Note that \eqref{eq:cost-ix} with $\gamma = 0$ may suffer from an unbounded variance.\footnote{This property is known to be the underlying reason that EXP3 cannot satisfy sublinear regret with high probability in adversarial bandits \citep{lattimore2018bandit}.} With $\gamma >0$ immediate calculation shows $\E_{t-1}\big[(\rvcosth_t'(x,a)-M_t(x,a))^2\big]$ cannot be written as a function of $|\cost_t(x,a)-M_t(x,a)|$ which, as our regret analysis demonstrates, results in failure of achieving optimistic expected regret bounds when utilizing \eqref{eq:cost-ix} with $\gamma \ge 0$.

We thus propose a new cost estimator that provably results in an optimistic expected regret bound in conjunction with a mirror descent-based update. The proposed estimator defined for all $\gamma\geq 0$ is as follows
\begin{align}\label{eq:cost-proposed}
    &\rvcosth_t(x,a) \\
    &= \frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\}+M_t(x,a). \nonumber
\end{align}
Crucially, the proposed estimator $\rvcosth_t(x,a)$ leverages the predictive estimators $M_t(x,a)$. In particular, in contrast to \eqref{eq:cost-ix} the unexplored state and action pairs incur the cost predicted by $M_t(x,a)$ as opposed to incurring zero cost. 
Also, \cite{wei2018more} suggested a similar cost estimator as \eqref{eq:cost-proposed} with $\gamma=0$ for the multi-armed bandit problem. However, our estimators in this paper address the problem of learning in MDPs and exploration parameter $\gamma>0$ is crucial to our analysis of high probability guarantee with \Cref{thm:bernstein} in \Cref{proof:bandit_hp}.

\Cref{thm:cost-prop} studies the statistical properties of the proposed estimator.
\begin{lemma}\label{thm:cost-prop}
    The proposed cost estimator \eqref{eq:cost-proposed} satisfies
    \begin{gather*}
    \E_{t-1}[ \rvcosth_t(x,a)] = \frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma},\\
    \E_{t-1}\big[(\rvcosth_t(x,a)-M_t(x,a))^2\big] \leq \frac{\big(\cost_t(x,a)-M_t(x,a)\big)^2}{\rho_t(x,a)+\gamma}.
    \end{gather*}
\end{lemma}

\paragraph{Variance reduction property.} This result shows that if $\gamma >0$ the variance is provably bounded. Furthermore, if $M_t(x,a) \leq 2\cost_t(x,a)$ for all $(x,a) \in \st \times \act$ and $t = 1,\dots T$, immediate calculation shows $|\cost_t(x,a)-M_t(x,a)|^2 \leq |\cost_t(x,a)|^2$. That is, the proposed estimator enjoys a lower variance compared to \eqref{eq:cost-ix}. Also if the predictors  $\{M_t\}_{t=1}^T$ are optimistic, i.e., $M_t(x,a)\leq \cost_t(x,a)$,  for all $t=1,\dots, T$ and $(x,a) \in \st \times \act$ then  the proposed cost estimator \eqref{eq:cost-proposed} is an \textit{optimistically biased} estimator given that
\begin{equation*}
    \begin{aligned}
    \E_{t-1}[ \rvcosth_t(x,a)] &= \frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}\leq \cost_t(x,a).
    \end{aligned}
\end{equation*}
Therefore, as long as $M_t(x,a)\leq \cost_t(x,a)$, compared to \eqref{eq:cost-ix}, the proposed estimator has the same bias while having a lower variance. Note that the condition $M_t(x,a)\leq \cost_t(x,a)$ is very mild and may be ensured in a variety of non-adversarial settings based on the observed cost signal.
Finally, note that different from \eqref{eq:cost-ix} the variance of the proposed estimator is controlled by the estimation power of the cost predictors. A feature we will leverage to achieve optimistic regret bounds. 

\begin{algorithm}[t]
\caption{OREPS with Optimistic Predictor and Implicit eXploration (OREPS-OPIX)} \label{alg:oreps-opix}
\begin{algorithmic}[1]
\REQUIRE Learning rate $\eta$, exploration parameter $\gamma$
\STATE Initialize occupancy measure $\rho_1(x,a)$ as a uniform distribution over $x\in\st_l$ and $a\in\act$ for $l=1,2,\dots,L-1$
\STATE Initialize cost predictor as $M_1=0$
\FOR{Episodes $t=1,2,\dots,T$}
\STATE Initialize cost estimator as $\rvcosth_t=0$
\FOR{Time steps $l=1,2,\dots,L-1$}
\STATE Observe state $x_l\in\st_l$ from the environment
\STATE Choose action $a_l\sim\rho_t(x_l,\cdot)$
\STATE Observe cost $c_t(x_l,a_l)$
\STATE Save $x_l$, $a_l$ and $c_t(x_l,a_l)$ to $u_t$
\ENDFOR
\FOR{Tuples $x,a,c_t(x,a)$ in $u_t$}
\STATE $\rvcosth_t(x,a) \leftarrow (c_t(x,a)-M_t(x,a))/(\rho_t(x,a)+\gamma)+M_t(x,a)$
\STATE Update $M_{t+1}(x,a)$
\ENDFOR
\STATE Solve $\rho_{t+1}=\argmin_{\rho \in \Delta(\mdp)}\eta\langle\rho, \rvcosth_t+M_{t+1}-M_t\rangle + D_R(\rho\|\rho_t)$.
\ENDFOR
\end{algorithmic}
\end{algorithm}
With the proposed cost estimator, we then utilize it in a mirror-descent type update by adopting the result of \cite{joulani2017modular}. In particular, given $\q_t$ the agent runs an episode exploration subroutine and subsequently employs 
\begin{equation}\label{eq:OREPS-OPIX}
    \rho_{t+1}=\argmin_{\rho \in \Delta(\mdp)}\eta\langle\rho, \rvcosth_t+M_{t+1}-M_t\rangle + D_R(\rho\|\rho_t).
\end{equation}
Please see \Cref{alg:oreps-opix} for a detailed description of the learning process.
We call the resulting scheme OREPS-OPIX. Analogous to the standard MD and OREPS algorithms, this update can be tackled efficiently through a well-known two-step procedure \citep{abernethy2009beating,lattimore2018bandit,zimin2013online}. Specifically, by adopting the result of \cite{zimin2013online},
\begin{equation}\label{eq:update}
    \rho_{t+1}(x,a) = \frac{\rho_t(x,a)e^{\beta(x,a|\hat{v}_t,\rvcosth_t)}}{\sum_{x'\in\st_l,a\in\act}\rho_t(x',a)e^{\beta(x',a|\hat{v}_t,\rvcosth_t)}}, 
\end{equation}
where $l$ denotes the layer in which state $x$ belongs to, $\beta$ is defined as
\begin{align*}
    \beta(x,a|\hat{v}_t,\rvcosth_t) &= -\eta(\rvcosth_t(x,a)+M_{t+1}(x,a)-M_t(x,a)) \\
    &\quad- \sum_{x'\in\st_{l+1}}\hat{v}_t(x')\P(x'|x,a)+\hat{v}_t(x),
\end{align*}
and $\hat{v}_t$ is defined as
\begin{equation*}
    \hat{v}_t = \arg\min_v\sum_{l=0}^L\ln\left\{\sum_{x\in X_l, a\in A}\rho_t(x,a)e^{\beta(x,a|v,\rvcosth_t)}\right\}.
\end{equation*}

Note that by setting $M_{t} = M_{t+1} = 0$, one recovers the OREPS algorithm. Further, in the full-information case, one can replace $\rvcosth_t$ with the observed cost vector $c_t$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Optimistic Regret Bounds}\label{sec:thm}
In this section, we provide a detailed regret analysis of the proposed OREPS-OPIX scheme in \eqref{eq:OREPS-OPIX} equipped with the proposed cost estimator in \eqref{eq:cost-proposed}. 

\Cref{theorem:full} establishes the regret bound under full information. For compactness, we denote the prediction error in episode $t$ as $\sigma_t=c_t-M_t$.
\begin{theorem}[Full information]
    \label{theorem:full}
    Under full information feedback, there exists a stepsize $\eta$ such that OREPS-OPIX satisfies
    \begin{equation}\label{eq:full}
        \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) = \Tilde{\O}\left(\sqrt{L\sum_{t=1}^T\|\sigma_t\|_\infty^2}\right).
    \end{equation}
\end{theorem}
To understand the benefit of leveraging cost predictors, assume $\sum_{t=1}^T \|\cost_t-M_t\|_\infty^2 = \O(T^\alpha)$ for some $0\leq\alpha\leq 1$ where $\alpha = 0$ and $\alpha = 1$ correspond to perfect estimation and futile estimation, respectively. Then, if $\eta = \O(T^{-\alpha/2})$, we have $ \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) = \tilde{\O}(T^{\alpha/2})$. That is, the regret can be constant while in the worst case, the regret is $\tilde{\O}(\sqrt{T})$.

A downside of Theorem \ref{theorem:full} is the requirement of full information on $\cost_t \in [0,1]^{|\st|\times|\act|}$ which is not a realistic assumption. Therefore, we next establish a bound on the expected regret of OREPS-OPIX under bandit feedback. As we discussed before, establishing optimistic regret bounds in the bandit setting for AMDPs seems to necessitate utilizing an estimator with bounded variance. Following \cite{neu2010online}, one could impose an assumption that ensures $\rho_t(x,a)>\alpha$ and establish regret bounds that scales with $\O(\alpha^{-1})$. Instead, we set $\gamma > 0$ but impose the mild assumption that the cost predictors $\{M_t\}_{t=1}^T$ are \textit{optimistic}, i.e., $M_t(x,a)\leq \cost_t(x,a)$.

\cite{fei2020dynamic} proposed an algorithm that directly estimates a state-action value function instead of a cost function that is used to exponentially update a policy. They further extended the algorithm to alternately update policy and value function twice, mirroring the two-step optimization of OMD. Conceptually, it is analogous to having a predictor as a Q-function that is updated with the previous episode's cost function. In the worst case, their static regret bound, where $P_T=0$, scales as $O(\sqrt{T})$.
\cite{zhao2022arxiv} investigated ensemble algorithms and imposed a lower bound on the occupancy measure for all states and actions. This regularization serves to bound the difference between the losses incurred by any two policies. They also explored optimistic variants by incorporating the two-projection OMD as originally proposed by \cite{rakhlin2013online, chiang2012online}. By leveraging this optimistic algorithm, they achieve static regret bounds of $\Tilde{\O}(L\sqrt{\sum^T\|c_t-M_t\|_\infty^2})$ in expectation, as opposed to \ref{eq:full}. It is worth noting that both works exclusively explored the full information setting. In the subsequent discussion, we analyze the bandit feedback setting.

\begin{theorem}[Bandit -- Expected]
    \label{theorem:bandit_exp}
     Under bandit feedback, there exists a stepsize $\eta$ and an exploration parameter $\gamma$ such that OREPS-OPIX utilizing the proposed cost estimator \eqref{eq:cost-proposed} satisfies
    \begin{equation}\label{eq:bandit_exp}
    \begin{aligned}
        &\E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] \\
        &\qquad= \Tilde{\O}\left(L^{\frac{1}{3}}\left(\sum_{t=1}^T\|\sigma_t\|_2^2 + \|\sigma_t\|_1\right)^{\frac{2}{3}}\right). 
    \end{aligned}
    \end{equation}
\end{theorem}
Note that the regret bound is optimistic as it scales with the estimation power of the cost predictors. Further, leveraging cost predictors is beneficial in the bandit feedback setting. In particular, the result of \Cref{theorem:bandit_exp} demonstrates if $\sum_{t=1}^T\|\cost_t-P_t\|_1 = \O(T^{\alpha-1})$ for some $0\leq\alpha\leq 1$ setting $\eta = \O(T^{-2\alpha/3})$ and $\gamma = \O(T^{-\alpha/3})$, OREPS-OPIX with the proposed cost estimator suffers $\tilde{\O}(T^{2\alpha/3})$ worst-case expected regret. Therefore, in the best case, the expected regret is constant while in the worst case, the regret is $\tilde{\O}(T^{2/3})$. Note that here our theoretical results may be sub-optimal in the worst-case as we cannot achieve $\tilde{O}(\sqrt{T})$ worst-case expected regret. Further study in this direction is a valuable future work.

Also, \cite{wei2021nonstationary} achieved the dynamic regret bound of $\tilde{\O}(\min\{\sqrt{QT},\Delta^{1/3}T^{2/3}\})$, where $Q$ and $\Delta$ denote the number and amount of changes in the cost function respectively. This is comparable to \Cref{theorem:bandit_exp} when $Q$ grows faster than $\tilde{\O}(T^{1/4})$. 
Still, the bound with the change parameter satisfying $\Delta(t)\ge\max_{\pi\in\Pi}|c_t(\pi)-c_{t+1}(\pi)|$ is pessimistic while our results can still lead an optimistic bound. To see this, consider a predictor designed with the cost suffered in the last episode: i.e., $M_{t+1}(\pi_t)=c_t(\pi_t)$. Then, the optimistic bound becomes $\sigma_t=|M_t(\bar{\pi})-c_t(\bar{\pi})|$, where $\bar{\pi}$ is a policy that visits all state-action pair once, and is a special case with the specific choice of the predictor.

Finally, we present our main result, which establishes a high probability sublinear optimistic regret bound for OREPS-OPIX.
\begin{theorem}[Bandit -- High probability]
    \label{theorem:bandit_hp}
    Under bandit feedback, there exists a stepsize $\eta$ and an exploration parameter $\gamma$ such that with probability $1-\delta$ OREPS-OPIX utilizing the proposed cost estimator \eqref{eq:cost-proposed} satisfies
    \begin{align} \label{eq:bandit_hp}
        &\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) = \Tilde{\O}\Bigg(\sqrt{\sum_{t=1}^T\|\sigma_t\|_1^2} \\
        &+\left(L\max_t\|\sigma_t\|_\infty\right)^\frac{1}{4} \left(\sum_{t=1}^T\|\sigma_t\|_\infty^2+\|\sigma_t\|_1\right)^\frac{3}{4}\Bigg). \nonumber
    \end{align}
\end{theorem}
We point out that the regret is, again, optimistic as it scales with the estimation power of the cost predictors. Therefore, in the best case, i.e., under perfect estimation, the regret is constant while in the worst case, the regret is $\tilde{\O}(T^{3/4})$, with high probability. Integral to establishing this result is the development of tailored technical lemmas and a new concentration inequality to ensure each of the individual terms in the regret remains optimistic. Further study to see the possibility of improving the regret to $\tilde{O}(\sqrt{T})$ is left as a future work.
\cite{lee2020bias} studies the AMDP setting and achieves a high probability guarantee with sublinear regret in the order of $\sqrt{T}$ using the log-barrier method instead of implicit exploration. However, their bound $\O\left(\sqrt{\langle\rho^*,\sum_{t=1}^Tc_t\rangle}\right)$ is in terms of the loss of the best policy as opposed to being optimistic while our bound $\O\left(d(\{c_t\}_{t=1}^T,\{M_t\}_{t=1}^T)^{3/4}\right)$ diminishes with the estimation power of cost predictors.

\textbf{Proof highlights.} Here we highlight the key steps towards establishing our main results stated in \Cref{theorem:bandit_hp}. The regret can be decomposed into
\begin{equation}\label{eq:decom}
\begin{aligned}
     &\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) \\
     &\quad= \sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle + \sum_{t=1}^T\langle\rho_t,c_t-\E_{t-1}[\rvcosth_t]\rangle\\
     &\qquad+\sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle
     + \sum_{t=1}^T\langle\rho^*,\rvcosth_t-c_t\rangle.
\end{aligned}
\end{equation}
The first term in \eqref{eq:decom} can be thought of as the regret of the proposed algorithm with full information when the sequence of the cost functions are $\{\rvcosth_t\}_{t=1}^{T}$. Hence we can use \Cref{theorem:full} as well as the result of \Cref{thm:cost-prop} to upper bound it with probability one according to
\begin{equation*}
    \sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma^2}\sum_{t=1}^T\|\sigma_t\|_\infty^2.
\end{equation*}
We then show that the second term can be bound with probability one using the definition of the proposed estimator \eqref{eq:cost-proposed} and the result of \Cref{thm:cost-prop} with
\begin{equation*}
    \sum_{t=1}^T\langle\rho_t,c_t-\E_{t-1}[\rvcosth_t]\rangle \leq \sum_{t=1}^T\gamma\|\sigma_t\|_1 .
\end{equation*}
To bound the third term, we show that it is the sum of a martingale difference sequence, hence by using the Azuma–Hoeffding inequality and a careful computation we can bound it with probability at least $1-\delta$ with an optimistic term:
\begin{equation*}
   \sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle \le \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|\sigma_t\|_1^2}.
\end{equation*}
Notably, this term is independent of $\eta$ and $\gamma$ and in the worst case scales as $\O(\sqrt{T})$.

The last term in \eqref{eq:decom} requires the development of a new Bernstein-type inequality (See Lemma 2 in the supplementary) to ensure this term can be bounded by an optimistic term. Using this new result we show that with probability at least $1-\delta$
\begin{equation*}
    \sum_{t=1}^T\langle\rho^*,\rvcosth_t-c_t\rangle \le \frac{L}{\gamma}\log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|\sigma_t\|_\infty.
\end{equation*}
Finally, optimizing for $\eta$ and setting $\gamma = \eta^{1/3}$ furnishes the proof of \Cref{theorem:bandit_hp}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Extension}\label{sec:ext}
\subsection{Anytime Optimistic Regret Bounds}
In this section, we discuss the extension of OREPS-OPIX to the anytime setting. To obtain the regret bounds in \Cref{sec:thm}, we have to utilize stepsize and exploration parameters that require the knowledge of typically unknown quantities, e.g., the horizon $T$. We alleviate this issue by utilizing the doubling trick technique \citep{besson2018doubling}. Note that compared to typical applications of the doubling trick, our setting necessitates further efforts. In particular, usually in the doubling trick the learning is divided into phases that double in length, and accordingly the stepsize is divided in half to compensate for the growing phase lengths. That is, the condition to decide when a particular phase ends is apparent. In our setting, similar to \cite{rakhlin2013online}, this condition is more involved as we outline next. Additionally, compared to \cite{rakhlin2013online}, given the more complicated setting of our problem and the intricate nature of the regret bounds, carrying out the doubling trick technique requires further innovations, especially for the high probability results.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{algorithm}[t]
\caption{Anytime OREPS-OPIX with Doubling Trick} \label{alg:doubling}
\begin{algorithmic}[1]
\REQUIRE Initial learning rate $\eta_0$, $\kappa =2$ (expected regret) or $\kappa =3$ (high probability regret)
\STATE Initialize phase number $i=1$, starting episode number $s_1=1$, learning rate $\eta_1=\eta_0/2$ and optimistic parameter $\gamma_1={\eta_1}^{1/\kappa}$
\FOR{Episodes $t=1,2,\dots$}
\STATE Interact with the environment and suffer the cost to compute $\Psi_{s_i:t}$
\IF{${\eta_i}^{-1}D_0<{\eta_i}^{1/\kappa}\Psi_{s_i:t}$}
\STATE $i \leftarrow i+1$
\STATE $s_i \leftarrow t$
\STATE $\eta_i \leftarrow 2^{-i}\eta_0$
\STATE $\gamma_i \leftarrow {\eta_i}^{1/\kappa}$
\ENDIF
\STATE Run the rest of \Cref{alg:oreps-opix} to compute $\hat{\cost}_t$, $M_{t+1}$ and $\rho_{t+1}$ using $\eta_i$ and $\gamma_i$
\ENDFOR
\end{algorithmic}
\end{algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As discussed, similar to the standard doubling trick \citep{besson2018doubling,lattimore2018bandit,rakhlin2013online}, the learning rate $\eta_i$ is reduced by half after every phase $i$ instead of a fixed $\eta$ that depends on $T$. However, the length of each phase does not necessarily double.

Let us first consider the setting of Theorem \ref{theorem:bandit_exp}. 
Let $D_0=L\log{\frac{|\st||\act|}{L}}$ and $\bar{c}_t(x,a)$ be an unbiased cost estimator, i.e., \Cref{eq:cost-proposed} with $\gamma=0$. And define $\Psi_{\tau:\tau'}=\sum_{t=\tau}^{\tau'}\left\{\|\bar{c}_t-M_t\|_2^2/2+\|\bar{c}_t-M_t\|_1\right\}$. Note that  $\E[\Psi_{1:T}]=\sum_{t=1}^T\frac{1}{2}\|c_t-M_t\|_2^2 + \|c_t-M_t\|_1$.

The reason to define $\Psi$ in this way is to use it (in addition to $D_0$) to determine when to terminate each phase (see step 4 in \Cref{alg:doubling}). Therefore, $\Psi$ must only contain information that is available to the learner. Since the optimistic regret bounds, naturally, depend on $c_t$ which is unknown in the bandit setting, directly utilizing the optimistic regret bound from  Theorem \ref{theorem:bandit_exp} is not feasible. This subtle reason as well as the different feedback model of our setting results in \textit{significantly} different anytime algorithms and analyses compared to \cite{rakhlin2013online}.

The above discussion leads to an anytime extension of OREPS-OPIX which is summarized in \Cref{alg:doubling}. This method satisfies the following expected regret bound under bandit feedback, which is comparable to \Cref{theorem:bandit_exp}.

\begin{theorem}[Anytime -- Bandit -- Expected]
    \label{theorem:doubling_exp}
    Under bandit feedback, there exists an initial stepsize $\eta_0$ such that \Cref{alg:doubling} with the exploration parameter $\gamma_i=\sqrt{\eta_i}$ satisfies
    \begin{equation}\label{eq:doubling_exp}
    \color{black}
    \begin{aligned}
       &\E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] \\
       &\quad= \Tilde{\O}\left(L^{1/3}\left(\sum_{t=1}^T\frac{\|\sigma_t\|_2^2}{2}+\|\sigma_t\|_1\right)^{2/3}\right) .
    \end{aligned}
    \end{equation}
\end{theorem}

In the full information setting, a similar doubling trick can be applied by comparing $\eta_i^{-1}D_0$ and $\eta_i\Psi_{s_i:t}$, where $\Psi_{\tau:\tau'}=\sum_{t=\tau}^{\tau'}\|c_t-M_t\|_\infty^2/2$. Since here $c_t$ is observed by the learner, we can directly leverage the bound from \Cref{theorem:full}
\begin{theorem}[Anytime -- Full information]
    \label{theorem:doubling_full}
    Under full information feedback, there exists an initial stepsize $\eta_0$ such that \Cref{alg:doubling} satisfies
    
    \begin{equation}
        \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) = \Tilde{\O}\left(\sqrt{L\sum_{t=1}^T\|\sigma_t\|_\infty^2}\right).
    \end{equation}
\end{theorem}

\subsection{Handling Unknown Transition}
In this section, we extend our prior results to the unknown transition setting. This allows the algorithm the flexibility to be used when the dynamics of MDP is not revealed to the learner. To model the unknown transition, we construct a confidence set of transition functions using the counting method as explored by \cite{jaksch2010near,azar2017minimax,rosenberg2019online,jin2020learning}. Specifically, we adopt a tighter confidence set from \citet[Equation 5]{jin2020learning}:
\begin{equation}\label{eq:confidence-set}
\begin{aligned}
    \mathcal{P} &= \bigg\{\hat{P}: \left|\hat{P}(x'|x,a)-\bar{P}(x'|x,a)\right|\le\epsilon(x'|x,a), \\
    &\forall (x,a,x')\in\st_k\times\act\times\st_{k+1}, k\in(0,L-1)\bigg\}
\end{aligned}
\end{equation}
where $\bar{P}$ is the count-based empirical transition probability and the confidence margin $\epsilon(x'|x,a)$ is defined as 
\begin{equation*}
    2\sqrt{\frac{\bar{P}(x'|x,a)\log\left(\frac{T|\st||\act|}{\delta}\right)}{\max\{1,N(x,a)-1\}}} + \frac{14\log\left(\frac{T|\st||\act|}{\delta}\right)}{3\max\{1,N(x,a)-1\}}
\end{equation*}
for $\delta\in(0,1)$ and state-action visit counter $N(x,a)$. And we propose a cost estimator as
\begin{align}\label{eq:cost-unknown-transition}
    &\rvcosth_t(x,a) \\
    &= \frac{\cost_t(x,a)-M_t(x,a)}{u_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\}+M_t(x,a), \nonumber
\end{align}
where $u_t(x,a)=\max_{P\in\mathcal{P}}\rho^{P,\pi_t}(x,a)$ is the upper occupancy bound over $\mathcal{P}$ and $\rho^{P,\pi_t}$ is the occupancy measure under the transition probability $P$ and the induced policy $\pi_t$ from $\rho_t$ as \eqref{eq:induced_policy}. Again, \eqref{eq:cost-unknown-transition} is an optimistically biased estimator given that the predictor is optimistic and $u_t(x,a)\ge\rho_t(x,a)$ by definition. Utilizing this new estimator in OREPS-OPIX, we obtain the following result.

\begin{theorem}[Unknown transition -- Bandit -- High probability]
    \label{theorem:unknown_transition_hp}
    Under bandit feedback with unknown transition, there exists a stepsize $\eta$ and an exploration parameter $\gamma$ such that with probability at least $1-7\delta$ OREPS-OPIX utilizing the proposed cost estimator \eqref{eq:cost-unknown-transition} satisfies
    \begin{equation}
    \begin{aligned}
        &\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) \\
        &\quad= \O\Biggl(L^\frac{1}{4}\left(\log\frac{|\st||\act|}{L}+\log\frac{L}{\delta}\max_t\|\sigma_t\|_\infty\right)^\frac{1}{4} \\
        &\qquad\cdot \left(\sum_{t=1}^T\|\sigma_t\|_\infty^2+\|\sigma_t\|_1\right)^\frac{3}{4} + \sqrt{\sum_{t=1}^T\|\sigma_t\|_1^2} \\
        &\qquad+ L|\st|\sqrt{|\act|T\log\frac{T|\st||\act|}{\delta}}\Biggr).
    \end{aligned}
    \end{equation}
\end{theorem}
Notice that in an optimistic case, the bound is dominated by the term $\O\left(L|\st|\sqrt{|\act|T\log\frac{T|\st||\act|}{\delta}}\right)$. Then the \Cref{theorem:unknown_transition_hp} achieves the same bound as \cite{jin2020learning} but with higher probability. This term arises from a judicious application of the Bennet's concentration inequality \citep[Corollary 5]{maurer2009empirical} to study how the error of the estimated occupancy measure $\rho^{P,\pi_t}$ with respect to $\rho_t$ of known transition setting is bounded within the confidence set \eqref{eq:confidence-set}; it is nontrivial and therefore an interesting direction of research to see if an optimistic version of this concentration inequality can be established, using, e.g., the techniques that led to our new  Bernstein-type inequality (See \Cref{thm:bernstein} in \Cref{proof:bandit_hp}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Numerical Experiments}\label{sec:exp}

\begin{figure*}[!t]
\subfigure[Average regret and variance of OREPS-OPIX, OREPS, and OREPS-IX.]{\includegraphics[width=0.48\textwidth]{OREPS_comparison_regret.pdf} \label{fig:oreps-opix}}
\hfill
\subfigure[Error of cost predictors against the true cost function.]{\includegraphics[width=0.48\textwidth]{UAI2024_error_comparison.pdf} \label{fig:oreps-opix-error}}
\caption{The result of numerical experiment of OREPS, OREPS-IX and OREPS-OPIX with different predictors plotted versus the number of episodes. \Cref{fig:oreps-opix} shows the regret reduction benefit as well as the variance reduction property of the proposed cost estimator \eqref{eq:cost-proposed}. \Cref{fig:oreps-opix-error} shows that the cost predictors comply with the optimistic prediction assumption.}
\label{fig:experiment_oreps}
\end{figure*}

In this section, we perform a simple experiment to demonstrate the benefit of implicit exploration and cost predictors. \footnote{The code for this experiment is accessible at this link: https://github.itap.purdue.edu/moon182/OREPS-OPIX.git}
We consider a drone navigation task modeled by a 2D grid, where the goal of the agent is to move by one cell at a time to reach the goal with minimal cost. If a drone enters a cell with turbulence or wind gust, it incurs higher cost due to higher fuel consumption and possible damage to the aircraft. The AMDP of the environment is described below:
\begin{itemize}
    \item State space: $\st = \{(l, A_x, A_y, G_x, G_y)\}$, \\
    where $l\in \{1,\dots,L\}$ is time step, $(A_x,A_y)$ is agent location and $(G_x,G_y)$ is goal location.
    \item Action space: $\act=\{$left, right, up, down$\}$
    \item Cost function: \\
    $c_t(x,a) = \begin{cases}
        0, & \quad\text{if reaching the goal} \\
        1, & \quad\text{if encountering a turbulence} \\
        \epsilon, & \quad\text{otherwise},
    \end{cases}$\\
    where $0<\epsilon<1$ is a small positive constant. $c_t$ changes every $t_w$ episodes when the occurrence of turbulence randomly move to one of its neighbors. It is not observable to the agent but results in higher cost.
    \item Bandit feedback: agent observes $c_t(x,a)$ only for its trajectory $(x,a)\in\u(t)$ in episode $t$.
    \item State transition is deterministic: \\
    $\P(s'|s,a)=\begin{cases}
        1, & \quad\text{when (x,a) results in s'} \\
        0, & \quad\text{otherwise}.
    \end{cases}$
    \item Wind incurs cost but does not affect state transitions.
    \item Timeout $L$ is the maximum time steps in an episode.
    \item When the agent reaches the goal, it remains in that terminal state $s_\text{terminal}^l$ until the end of the episode regardless of its action, that is, $\tr(s_\text{terminal}^{l+1}|s_\text{terminal}^l,\act)=1$ and $\st_L=\{s_\text{terminal}^L\}$ is singleton.
\end{itemize}
The details of the experiment setting are provided in the Appendix.

Figure \ref{fig:oreps-opix} depicts the performance (in terms of cumulative average regret) of OREPS-OPIX compared with vanilla OREPS and OREPS with implicit exploration. For OREPS-OPIX with perfect predictor, it is assumed that we have access to a perfect predictor with full information ($M_t=c_t$, $M_{t+1}=c_{t+1}$). A more realistic latest predictor predicts the cost based on the cost that the learner suffered in the last visit to the state and the action. It mildly assumes that we have access to the period $t_w$ and it resets its value to zero every $t_w$ episodes to assure optimistic prediction.

There are two notable points to this result. First, OREPS without implicit exploration (in blue) explodes as learning progresses. This happens when the value of occupancy measure for some states and actions approach 0: $\rho_t(x,a)\rightarrow0$. Then, the unbiased cost estimator, i.e., \eqref{eq:cost-ix} with $\gamma = 0$, which divides cost signal by occupancy measure, grows infinitely large and $\rho_t(x,a)$ actually becomes 0 due to the precision of the floating point. And it remains to be 0 for the remainder of the episode, because the occupancy measure is updated multiplicatively according to \eqref{eq:update}. This phenomenon is consistent with the fact that the naive importance-weighted cost estimator in OREPS which is based on EXP3 suffers from a high variance. Secondly, OREPS-OPIX (in green and red) improves both convergence and variance over OREPS-IX (in orange), which is consistent with the result of \Cref{thm:cost-prop} on the reduced variance of the proposed cost estimator \eqref{eq:cost-proposed} while retaining the same bias.

\Cref{fig:oreps-opix-error} demonstrates the error of optimistic cost predictors with respect to the true cost. By observing the positive values of error, we confirm that the formulation of cost predictors does not violate the optimistic prediction assumption. Every $t_w=1000$ episodes, the error of the latest predictor spikes, because it periodically resets its value to zero.

\begin{figure*}[!t]
\subfigure[Average regret of OREPS-OPIX with different predictors.]{\includegraphics[width=0.47\textwidth]{UAI2024_predictor_comparison_regret.pdf} \label{fig:predictors-regret}}
\hfill
\subfigure[Error of cost predictors against the true cost function.]{\includegraphics[width=0.48\textwidth]{UAI2024_error_comparison_all.pdf} \label{fig:predictor-error}}
\caption{The result of numerical experiment of OREPS-OPIX with different predictors plotted versus the number of episodes. \Cref{fig:predictors-regret} shows that less accurate information about $t_w$ do not cause significant harm in the performance of OREPS-OPIX. \Cref{fig:predictor-error} shows the consequences on error when cost predictors are constructed based on inaccurate information about the environment.}
\label{fig:experiment_predictor}
\end{figure*}

In \Cref{fig:predictors-regret}, we relax the optimistic prediction assumption with inaccurate information about how frequently the cost function changes. The latest predictor with more reset (in purple) and less reset (in brown) assumes shorter and longer periods of change, respectively, than the true value of $t_w$. However, the results show that the performance degradation is not noticeable compared to the latest predictor with accurate information about the period (in green). In fact, the strict optimism of the predictor is introduced for mathematical convenience and it is sufficient to hold in a (weighted) sum: $\sum_{x,a}\omega(x,a)(c_t(x,a)-M_t(x,a))\ge0$ with $\omega(\cdot)=1$ or $\omega(\cdot)>0$. Intuitively, what is more critical is how far the prediction is to the true cost function. 

\Cref{fig:predictor-error} shows the error, i.e., $\sum_{x,a}{c_t(x,a)-M_t(x,a)}$, of different predictors. The latest predictor with more reset and less reset is built based on incorrect information of the period of cost change, as $\hat{t}_m=500$ and $\hat{t}_m=2000$ respectively. Although \Cref{fig:predictors-regret} demonstrates minimal loss in the performance of OREPS-OPIX when predictor design is based on a flawed information, \Cref{fig:predictor-error} shows that the predictor error is actually aggravated by the flaws (purple and brown as opposed to green). It even shows that the latest predictor with less reset (brown) violates the optimistic prediction assumption when cost function changes without the reset, observed at $t=1000,3000,\dots,9000$. The result hints at the practical success of our algorithm in the presence of minor uncertainties in the predictor design.

Finally, \Cref{fig:oreps-opix-error,fig:predictor-error} also exhibits a tendency that the error grows higher over time as the occupancy measure converges. It is the result of slower convergence of $M_t$, which is caused by the reduced entropy of the occupancy measure. \Cref{eq:update} updates the occupancy measure by discounting its value exponentially with respect to the loss (estimate) and forces the value of a state-action pair with relatively high loss (estimate) to approach to zero. From the OREPS regret plot (blue) in \Cref{fig:oreps-opix}, the exploding regret is also observed, that is due to the fact that a state-action pair with near-zero occupancy measure cannot be visited again without implicit exploration.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusion}\label{sec:conc}
We studied the problem of establishing optimistic regret bounds for online learning in AMDPs. Our theoretical analysis demonstrated that such bounds in the bandit feedback setting necessitate cost estimators with a bounded variance that scales with the estimation power of cost predictors. To that end, we proposed a new estimator that benefits from variance reduction and proved that this estimator in conjunction with a variant of mirror descent enjoys optimistic regret bounds in both full information and bandit feedback settings. Notably, we showed the proposed method and its anytime extension enjoy high probability sublinear optimistic regrets, a result which crucially relied on the characteristics of the new cost estimator and the development of new technical lemmas to ensure every term in the regret decomposition can be bounded by optimistic terms. Finally, we provided an extension to the unknown transition setting and established similar results.

In MDP setting, the cost function remains constant over time and direct optimization of the cost function without bounding the relative entropy becomes feasible. In the case of full information feedback, the cost function is fully observed after the initial episode, resulting in zero regret from the second episode onward. In bandit feedback case, we have a bound with diminishing prediction error $c_t-M_t$, as costs are revealed for additional states and actions. The rate at which the error reduces and efficient strategies for its reduction present an interesting direction for future research.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work was supported in part by NSF CNS 2313109 and the Manufacturing Design Laboratory (MDLab) at Purdue University.
\end{acknowledgements}

% References
\bibliography{ref}

\newpage

\onecolumn

\title{Optimistic Regret Bounds for Online Learning in \\Adversarial Markov Decision Processes \\(Supplementary Material)}
\maketitle

\appendix
\section{Proofs}
\subsection{Proof of Lemma \ref{thm:cost-prop}}
By the definition of $\rvcosth(x,a)$ as \eqref{eq:cost-proposed},
\begin{align*}
    \E_{t-1}[\rvcosth_t(x,a)] &= \E_{t-1}\left[\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\ind\{(x,a)\in \bar{\u}_L(t)\}+M_t(s,a)\right] \\
    &= \frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\rho_t(x,a)+M_t(s,a) \\
    &= \frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}.
\end{align*}
By \eqref{eq:cost-proposed}, $\rho_t(\cdot)\ge0$ and $\gamma\ge0$,
\begin{align*}
    \E_{t-1}[(\rvcosth_t(x,a)-M_t(x,a))^2] &= \E_{t-1}\left[\left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\ind\{(x,a)\in \bar{\u}_L(t)\}\right)^2\right] \\
    &= \left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\right)^2\rho_t(x,a) \\
    &\le \frac{(\cost_t(x,a)-M_t(x,a))^2}{\rho_t(x,a)+\gamma}.
\end{align*}
\qedsymbol

\subsection{Proof of Theorem \ref{theorem:full}}
\label{proof:full}
First, decompose the regret of $\rho_t$ with respect to $\rho^*$ as
\begin{equation}
    \langle c_t,\rho_t-\rho^*\rangle = \langle c_t,\rho_{t+1}-\rho^*\rangle + \langle c_t,\rho_t-\rho_{t+1}\rangle. \label{eq:full_proof1}
\end{equation}
If $\rho_{t+1}$ is the solution of \eqref{eq:OREPS-OPIX} with $\cost_t$ instead of $\rvcosth$, then for any other $\rho^*\in\Delta(\mdp)$, the gradient of the objective function is negative in the direction of $\rho_{t+1}$ from $\rho^*$: i.e., $\langle\nabla_\rho\{\eta\langle\rho,\cost_t+M_{t+1}-M_t\rangle + D_R(\rho\|\rho_t)\}_{\rho=\rho_{t+1}}, \rho_{t+1}-\rho^*\rangle \le 0$. Thus,
\begin{equation*}
    \langle\eta(\cost_t+M_{t+1}-M_t) + \nabla R(\rho_{t+1}) - \nabla R(\rho_t), \rho_{t+1}-\rho^*\rangle \le 0.
\end{equation*}
The first term of the decomposition \eqref{eq:full_proof1} is then bounded as
\begin{equation*}
    \langle c_t,\rho_{t+1}-\rho^*\rangle \le \frac{1}{\eta}\langle\nabla R(\rho_t)-\nabla R(\rho_{t+1}),\rho_{t+1}-\rho^*\rangle + \langle M_t-M_{t+1}, \rho_{t+1}-\rho^*\rangle.
\end{equation*}
By the definition of Bregman divergence: $D_R(\rho\|\rho')=R(\rho)-R(\rho')-\langle\nabla R(\rho'),\rho-\rho'\rangle$,
\begin{align*}
    \langle c_t,\rho_{t+1}-\rho^*\rangle &\le \frac{1}{\eta}\{D_R(\rho^*\|\rho_t)-D_R(\rho^*\|\rho_{t+1})-D_R(\rho_{t+1}\|\rho_t)\} + \langle M_t-M_{t+1}, \rho_{t+1}-\rho^*\rangle \\
    &= \frac{1}{\eta}\{D_R(\rho^*\|\rho_t)-D_R(\rho^*\|\rho_{t+1})-D_R(\rho_{t+1}\|\rho_t)\} + \langle M_{t+1}-M_t, \rho^*-\rho_{t+1}\rangle \\
    &= \frac{1}{\eta}\{D_R(\rho^*\|\rho_t)-D_R(\rho^*\|\rho_{t+1})-D_R(\rho_{t+1}\|\rho_t)\} \nonumber\\
    &\qquad+ \langle M_{t+1}-M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}-M_t, \rho_t-\rho_{t+1}\rangle.
\end{align*}
Plugging the result back to \eqref{eq:full_proof1},
\begin{align}
    \langle c_t,\rho_t-\rho^*\rangle &\le  \frac{1}{\eta}\{D_R(\rho^*\|\rho_t)-D_R(\rho^*\|\rho_{t+1})-D_R(\rho_{t+1}\|\rho_t)\} \nonumber\\
    &\qquad+ \langle M_{t+1}-M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}, \rho_t-\rho_{t+1}\rangle + \langle c_t-M_t, \rho_t-\rho_{t+1}\rangle \nonumber\\
    &\begin{aligned}\label{eq:full_proof2}
    &=\frac{1}{\eta}\{D(\rho^*\|\rho_t)-D(\rho^*\|\rho_{t+1})-D(\rho_{t+1}\|\rho_t)\} \\
    &\qquad- \langle M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}, \rho^*-\rho_{t+1}\rangle + \langle c_t-M_t, \rho_t-\rho_{t+1}\rangle.
    \end{aligned}
\end{align}
By Holder's and Young's inequalities,
\begin{align*}
    \langle c_t-M_t, \rho_t-\rho_{t+1}\rangle &\le \frac{\eta}{2}\|c_t-M_t\|_\infty^2 + \frac{1}{2\eta}\|\rho_t-\rho_{t+1}\|_1^2.
\end{align*}
Since negative entropy is 1-strongly convex with respect to $L_1$ norm,
\begin{equation*}
    \frac{1}{2}\|\rho_t-\rho_{t+1}\|_1^2 \le R(\rho_{t+1})-R(\rho_t)-\langle\nabla R(\rho_t),\rho_{t+1}-\rho_t\rangle = D_R(\rho_{t+1}\|\rho_t).
\end{equation*}
Plugging the result back to \eqref{eq:full_proof2},
\begin{align*}
    \langle c_t,\rho_t-\rho^*\rangle &\le \frac{1}{\eta}\{D(\rho^*\|\rho_t)-D(\rho^*\|\rho_{t+1})-D(\rho_{t+1}\|\rho_t)\} \\
    &\qquad- \langle M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}, \rho^*-\rho_{t+1}\rangle + \frac{\eta}{2}\|c_t-M_t\|_\infty^2 + \frac{1}{\eta}D(\rho_{t+1}\|\rho_t) \\
    &= \frac{1}{\eta}\{D(\rho^*\|\rho_t)-D(\rho^*\|\rho_{t+1})\} - \langle M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}, \rho^*-\rho_{t+1}\rangle \\
    &\qquad+ \frac{\eta}{2}\|c_t-M_t\|_\infty^2.
\end{align*}
By summing over $T$ episodes,
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &= \sum_{t=1}^T\langle c_t,\rho_t-\rho^*\rangle \\
    &\le \frac{1}{\eta}\{D(\rho^*\|\rho_1)-D(\rho^*\|\rho_{T+1})\} - \langle M_1, \rho^*-\rho_1\rangle + \langle M_{T+1}, \rho^*-\rho_{T+1}\rangle \\
    &\qquad+ \sum_{t=1}^T\frac{\eta}{2}\|c_t-M_t\|_\infty^2.
\end{align*}
Without losing generality, we can set $M_1=M_{T+1}=0$. And by the non-negativity and definition of Bregman divergence,
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &\le \frac{1}{\eta}D(\rho^*\|\rho_1)  + \sum_{t=1}^T\frac{\eta}{2}\|c_t-M_t\|_\infty^2 \\\
    &= \frac{1}{\eta}\{R(\rho^*)-R(\rho_1)-\langle\nabla R(\rho_1), \rho^*-\rho_1\rangle\} + \sum_{t=1}^T\frac{\eta}{2}\|c_t-M_t\|_\infty^2.
\end{align*}
Since negative entropy $R(\cdot)\le0$ and $\rho_1$ is initialized as a uniform distribution, of which $\nabla R(\rho_1)=0$,
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &\le -\frac{1}{\eta}R(\rho_1) + \sum_{t=1}^T\frac{\eta}{2}\|\cost_t-M_t\|_\infty^2 \\
    &= \frac{1}{\eta}\sum_{k=0}^{L-1}\sum_{x\in\st_k}\sum_a(-\rho_1(x,a)\log(\rho_1(x,a)) + \sum_{t=1}^T\frac{\eta}{2}\|\cost_t-M_t\|_\infty^2 \\
    &= \frac{1}{\eta}\sum_{k=0}^{L-1}\sum_{x\in\st_k}\sum_a\frac{1}{|\st_k||\act|}\log|\st_k||\act| + \sum_{t=1}^T\frac{\eta}{2}\|\cost_t-M_t\|_\infty^2 \\
    &= \frac{1}{\eta}\sum_{k=0}^{L-1}\log|\st_k||\act| + \sum_{t=1}^T\frac{\eta}{2}\|\cost_t-M_t\|_\infty^2 \\
    &= \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \sum_{t=1}^T\frac{\eta}{2}\|\cost_t-M_t\|_\infty^2.
\end{align*}

If $\eta=\sqrt{\frac{2L}{\sum\|c_t-M_t\|_\infty^2}\log\frac{|\st||\act|}{L}}$,
\begin{equation*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) \le \sqrt{2L\log\frac{|\st||\act|}{L}\sum_{t=1}^T\|c_t-M_t\|_\infty^2}.
\end{equation*}
\qedsymbol

\subsection{Proof of Theorem \ref{theorem:bandit_exp}}
The expected total regret with respect to $\rho^*\in\Delta(\mdp)$ can be decomposed into
\begin{equation}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \cost_t\rangle\right] = \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \rvcosth_t\rangle\right] + \E\left[\sum_{t=1}^T\langle\rho_t, \cost_t-\rvcosth_t\rangle\right] - \E\left[\sum_{t=1}^T\langle\rho^*, \cost_t-\rvcosth_t\rangle\right]. \label{eq:bandit_decomp}
\end{equation}

For the first term in \eqref{eq:bandit_decomp}, follow the same proof as \Cref{proof:full} with $\cost_t\leftarrow\rvcosth_t$. Let $\rho_{t+1}$ be the solution of \eqref{eq:OREPS-OPIX} and decompose the term as
\begin{equation*}
    \langle\rho_t-\rho^*,\rvcosth_t\rangle = \langle\rho_{t+1}-\rho^*,\rvcosth_t\rangle + \langle\rho_t-\rho_{t+1},\rvcosth_t\rangle.
\end{equation*}
For all $\rho^*\in\Delta(\mdp)$, the gradient of \eqref{eq:OREPS-OPIX} is negative in the direction of $\rho_{t+1}$ from $\rho^*$: i.e., $\langle\nabla_\rho\{\eta\langle\rho,\rvcosth_t+M_{t+1}-M_t\rangle+D(\rho\|\rho_t)\}_{\rho=\tilde{\rho}_{t+1}}, \rho_{t+1}-\rho^*\rangle\le0$. Thus, following \Cref{proof:full},
\begin{align*}
    \langle\rho_t-\rho^*,\rvcosth_t\rangle &\le \frac{1}{\eta}\langle\rho_{t+1}-\rho^*,\nabla R(\rho_t)-\nabla R(\rho_{t+1})\rangle + \langle\rho_{t+1}-\rho^*,M_t-M_{t+1}\rangle + \langle\rho_t-\rho_{t+1},\rvcosth_t\rangle \\
    &\le \frac{1}{\eta}\{D(\rho^*\|\rho_t)-D(\rho^*\|\rho_{t+1})\} - \langle M_t, \rho^*-\rho_t\rangle + \langle M_{t+1}, \rho^*-\rho_{t+1}\rangle + \frac{\eta}{2}\|\rvcosth_t-M_t\|_\infty^2.
\end{align*}
Adding over $t$ episodes, following \Cref{proof:full} again,
\begin{equation*}
    \sum_{t=1}^T\langle\rho_t-\rho^*,\hat{c}_t\rangle \le \frac{L}{\eta}\log\frac{|\mathcal{X}||\mathcal{A}|}{L} + \sum_{t=1}^T\frac{\eta}{2}\|\hat{c}_t-M_t\|_\infty^2.
\end{equation*}
Taking expectation over the randomness associated with $\u(T)$,
\begin{align*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle\right] &\le \E\left[\frac{L}{\eta}\log\frac{|\st||\act|}{L} + \sum_{t=1}^T\frac{\eta}{2}\|\rvcosth_t-M_t\|_\infty^2\right] \\
    &= \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\E\left[\|\rvcosth_t-M_t\|_\infty^2\right].
\end{align*}
By tower expectation,
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle\right] \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\E\left[\E_{t-1}[\|\rvcosth_t-M_t\|_\infty^2]\right].
\end{equation*}
By $\|\cdot\|_\infty^2\le\|\cdot\|_2^2$ and \Cref{thm:cost-prop},
\begin{align*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle\right] &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\E\left[\E_{t-1}[\|\hat{c}_t-M_t\|_2^2]\right] \\
    &= \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\E\left[\sum_{x,a}\E_{t-1}[(\rvcosth_t(x,a)-M_t(x,a))^2]\right] \\
    &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\E\left[\sum_{x,a}\frac{(\cost_t(x,a)-M_t(x,a))^2}{\rho_t(x,a)+\gamma}\right].
\end{align*}
Since $\rho_t(\cdot)\ge0$ and $\gamma>0$,
\begin{equation}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle\right] \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma}\sum_{t=1}^T\|\cost_t-M_t\|_2^2.
    \label{eq:bandit_decomp1}
\end{equation}

The second term in \eqref{eq:bandit_decomp} can be decomposed into
\begin{equation*}
    \langle\rho_t,\cost_t-\rvcosth_t\rangle = \langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle + \langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle.
\end{equation*}
By \Cref{thm:cost-prop},
\begin{align*}
    \langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle &= \sum_{x,a}\rho_t(x,a)\left\{\cost_t(x,a)-\frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}\right\} \\
    &= \sum_{x,a}\rho_t(x,a)\frac{\gamma\cost_t(x,a)-\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}.
\end{align*}
Since $\rho_t(\cdot)\ge0$, $\gamma>0$ and $M_t(x,a)\le \cost_t(x,a)$ for all $x,a$,
\begin{equation*}
    \langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle \le \sum_{x,a}\gamma|\cost_t(x,a)-M_t(x,a)| = \gamma\|\cost_t-M_t\|_1.
\end{equation*}
Adding over $T$ episodes and taking expectation with respect to the randomness associated with $\u(T)$,
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle\right] \le \sum_{t=1}^T\gamma\|\cost_t-M_t\|_1.
\end{equation*}
Also, since $\{\E_{t-1}[\rvcosth_t]-\rvcosth_t\}_t$ is a Martingale difference sequence (MDS),
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle\right] = 0.
\end{equation*}
Thus, the second term in \eqref{eq:bandit_decomp} is bounded as
\begin{equation}
    \E\left[\sum_{t=1}^T\langle\rho_t,\cost_t-\rvcosth_t\rangle\right] \le \gamma\sum_{t=1}^T\|\cost_t-M_t\|_1. \label{eq:bandit_decomp2}
\end{equation}

Finally, since $\rho^*$ is constant with respect to $t$ and the randomness associated with $\u(T)$,
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho^*, \cost_t-\rvcosth_t\rangle\right] = \langle\rho^*, \E\left[\sum_{t=1}^T\cost_t-\rvcosth_t\right]\rangle.
\end{equation*}
By tower expectation and \Cref{thm:cost-prop},
\begin{align*}
    \E\left[\sum_{t=1}^T\langle\rho^*, \cost_t-\rvcosth_t\rangle\right] &= \langle\rho^*, \E\left[\sum_{t=1}^T\cost_t-\E_{t-1}[\rvcosth_t]\right]\rangle \\
    &= \sum_{x,a}\rho^*(x,a)\E\left[\sum_{t=1}^T\cost_t(x,a)-\frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}\right] \\
    &= \sum_{x,a}\rho^*(x,a)\E\left[\sum_{t=1}^T\frac{\gamma(\cost_t- M_t(x,a))}{\rho_t(x,a)+\gamma}\right]
\end{align*}
Since $\rho_t(\cdot), \rho^*(\cdot)\ge0$, $\gamma>0$ and $M_t(x,a)\le \cost_t(x,a)$ for all $x,a$,
\begin{equation}
    \E\left[\sum_{t=1}^T\langle\rho^*, \cost_t-\rvcosth_t\rangle\right] \ge 0 \label{eq:bandit_decomp3}
\end{equation}

Applying \eqref{eq:bandit_decomp1}, \eqref{eq:bandit_decomp2} and \Cref{eq:bandit_decomp3} to \Cref{eq:bandit_decomp}:
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \cost_t\rangle\right]
    \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma}\sum_{t=1}^T\left\|\cost_t-M_t\right\|_2^2 + \gamma\sum_{t=1}^T\|\cost_t-M_t\|_1
\end{equation*}

If $\eta=\left(\frac{L}{\sum\frac{1}{2}\|\cost_t-M_t\|_2^2 + \|\cost_t-M_t\|_1}\log{\frac{|\st||\act|}{L}}\right)^{2/3}$ and $\gamma=\sqrt{\eta}$,
\begin{equation*}
    \E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] \le \left(L\log{\frac{|\st||\act|}{L}}\right)^{1/3}\left(\sum\frac{1}{2}\|\cost_t-M_t\|_2^2 + \|\cost_t-M_t\|_1\right)^{2/3}
\end{equation*}
\qedsymbol

\subsection{Proof of Theorem \ref{theorem:bandit_hp}}
\label{proof:bandit_hp}
The total regret can be decomposed as \eqref{eq:decom}. The first term in \eqref{eq:decom} can be thought of as the regret of the the proposed algorithm with full information when the sequence of the cost functions are $\{\rvcosth_t\}_{t=1}^{T}$. By \Cref{theorem:full},
\begin{equation*}
    \sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2}\sum_{t=1}^T\|\rvcosth_t-M_t\|_\infty^2.
\end{equation*}
By \eqref{eq:cost-proposed},
\begin{align*}
    \|\rvcosth_t-M_t\|_\infty^2 &= \max_{x,a}\left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\}\right)^2 \\
    &= \max_{(x,a)\in \bar{\u}_L(t)}\left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\right)^2 \\
    &\le \max_{x,a}\left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\right)^2.
\end{align*}
By $\rho_t(\cdot)\ge0$ and $\gamma\ge0$,
\begin{align*}
    \|\rvcosth_t-M_t\|_\infty^2 &\le \frac{\max_{x,a}\left(\cost_t(x,a)-M_t(x,a)\right)^2}{\gamma^2} \\
    &= \frac{\|\cost_t-M_t\|_\infty^2}{\gamma^2}
\end{align*}
Thus, the first term is bounded with probability one as
\begin{equation}
    \sum_{t=1}^T\langle\rho_t-\rho^*,\rvcosth_t\rangle \le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma^2}\sum_{t=1}^T\|\cost_t-M_t\|_\infty^2. \label{eq:decom1}
\end{equation}

Using \Cref{thm:cost-prop}, the second term is rewritten as
\begin{align*}
    \sum_{t=1}^T\langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle &= \sum_{t=1}^T\sum_{x,a}\rho_t(x,a)\left(\cost_t(x,a)-\frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}\right) \\
    &= \sum_{t=1}^T\sum_{x,a}\rho_t(x,a)\left(\frac{\gamma\cost_t(x,a)-\gamma M_t(x,a)}{\rho_t(x,a)+\gamma}\right)
\end{align*}
By $M_t(x,a)\le\cost_t(x,a)$, $\rho_t(\cdot)\ge0$ and $\gamma\ge0$, it is bounded with probability one with
\begin{align}
    \sum_{t=1}^T\langle\rho_t,\cost_t-\E_{t-1}[\rvcosth_t]\rangle &\le \sum_{t=1}^T\sum_{x,a}\gamma\cost_t(x,a)-\gamma M_t(x,a) \nonumber\\
    &= \gamma\sum_{t=1}^T\|\cost_t-M_t\|_1. \label{eq:decom2}
\end{align}

By \Cref{thm:cost-prop} and \eqref{eq:cost-proposed},
\begin{align*}
    \E_{t-1}[\rvcosth_t(x,a)]-\rvcosth_t(x,a) &= \frac{\rho_t(x,a)\cost_t(x,a)+\gamma M_t(x,a)}{\rho_t(x,a)+\gamma} \\
    &\qquad- \left(\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\}+M_t(x,a)\right) \\
    &= \frac{(\rho_t(x,a)-\ind\{(x,a) \in \bar{\u}_L(t)\})(\cost_t(x,a)-M_t(x,a))}{\rho_t(x,a)+\gamma}.
\end{align*}
By $M_t(x,a)\le\cost_t(x,a)$, $\rho_t(\cdot)\ge0$ and $\gamma\ge0$,
\begin{equation*}
    \E_{t-1}[\rvcosth_t(x,a)]-\rvcosth_t(x,a) \le |\cost_t(x,a)-M_t(x,a)|
\end{equation*}
Thus,
\begin{align*}
    \sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle &\le \sum_{t=1}^T\langle\rho_t,\|c_t-M_t\|_1\rangle.
\end{align*}
Since $\left\{\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle\right\}_{t=1}^T$ is a martingale difference sequence, by using the Azuma–Hoeffding inequality,
\begin{align*}
    \P\left(\sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle\ge\epsilon\right) \le \exp\left(\frac{-\epsilon^2}{2\sum_{t=1}^T\|c_t-M_t\|_1^2}\right) = \delta.
\end{align*}
Therefore, with probability at least $1-\delta$, the third term is bounded with
\begin{equation}
   \sum_{t=1}^T\langle\rho_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle \le \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}. \label{eq:decom3}
\end{equation}

\begin{lemma}\label{thm:bernstein}
    Let $\{X_t\}_{t=1}^T$ be an $\mathbb{F}$-adapted sequence with the Filtration $\mathbb{F}=(\mathcal{F}_t)_t$. Define $\E_t[\cdot]=\E[\cdot|\mathcal{F}]$. Let $\{\eta_t\}_{t=1}^T$ be an $\mathbb{F}$-predictable sequence. Then, if $\eta_t\ge0$ and $\eta_t(X_t-\E_{t-1}[X_t])\le1.79$, we have
    \begin{equation}
        \P\left(\sum_{t=1}^T\eta_t(X_t-\mu_t)\ge\sum_{t=1}^T\eta_t^2\E_{t-1}[X_t^2]+\log{\frac{1}{\delta}}\right) \le \delta.
    \end{equation}
\end{lemma}
\textbf{Proof.} Let $\alpha_t=\E_{t-1}[\eta_t(X_t-\E_{t-1}[X_t])^2]=\eta_t\E_{t-1}[(X_t-\E_{t-1}[X_t])^2]$ and $\mu_t=\E_{t-1}[X_t]$. Since $\eta_t$ is $\mathbb{F}$-predictable, by Markov inequality,
\begin{align}
    \P\left(\sum_{t=1}^T\eta_t(X_t-\mu_t-\alpha_t) \ge \log{\frac{1}{\delta}}\right) &= \P\left(\exp\left(\sum_{t=1}^T
\eta_t(X_t-\mu_t-\alpha_t)\right) \ge \frac{1}{\delta}\right) \nonumber\\
    &\le \delta\E\left[\exp\left(\sum_{t=1}^T\eta_t(X_t-\mu_t-\alpha_t)\right)\right]. \label{eq:markov}
\end{align}
Let $Z_n=\exp\left(\sum_{t=1}^n\eta_t(X_t-\mu_t-\alpha_t)\right)$ and $y_{n+1}=\exp(\eta_{n+1}(X_{n+1}-\mu_{n+1}-\alpha_{n+1}))$. Since $Z_{n+1}=Z_n y_{n+1}$ and $Z_n$ is $\mathbb{F}$-adapted,
\begin{align*}
    \E[Z_{n+1}|\mathcal{F}_n] &= \E[Z_n y_{n+1}|\mathcal{F}_n] \\
    &= Z_n\E[y_{n+1}|\mathcal{F}_n].
\end{align*}
By the fact that $\eta_n\alpha_n$ is $\mathbb{F}$-predictable, $\exp(x)\le1+x+x^2$ for $x<1.79$ and $1+x\le\exp(x)$,
\begin{align*}
    \E_{n-1}[y_n] &= \exp(-\eta_n\alpha_n)\E_{n-1}[\exp(\eta_n(X_n-\mu_n))] \\
    &\le \exp(-\eta_n\alpha_n)\E_{n-1}[1+(\eta_n(X_n-\mu_n))+(\eta_n(X_n-\mu_n))^2] \\
    &= \exp(-\eta_n\alpha_n)(1+\eta_n\E_{n-1}[X_n]-\eta_n\mu_n+\eta_n^2\E_{n-1}[(X_n-\mu_n)^2]) \\
    &= \exp(-\eta_n\alpha_n)(1+\eta_n^2\E_{n-1}[(X_n-\mu_n)^2]) \\
    &\qquad\text{(By $\mu_n=\E_{n-1}[X_n]$)} \\
    &\le \exp(-\eta_n\alpha_n)\exp(\eta_n^2\E_{n-1}[(X_n-\mu_n)^2]) \\
    &= \exp(-\eta_n^2\E_{n-1}[(X_n-\mu_n)^2])\exp(\eta_n^2\E_{n-1}[(X_n-\mu_n)^2]) = 1 \\
    &\qquad\text{(By $\alpha_n=\eta_n\E_{n-1}[(X_n-\mu_n)^2]$}.
\end{align*}
Therefore $Z_n$ is a supermartingale: i.e.
\begin{equation*}
    \E_n[Z_{n+1}] = Z_n\E_n[y_{n+1}] \le Z_n.
\end{equation*}
By tower expectation,
\begin{equation*}
    \E[Z_n] = \E\left[\E_{n-1}[Z_n]\right] \le \E[Z_{n-1}] \le ... \le \E[Z_1] = \E[y_1] \le 1.
\end{equation*}
Apply this result back to \eqref{eq:markov},
\begin{equation*}
    \P\left(\sum_{t=1}^T\eta_t(X_t-\mu_t-\alpha_t) \ge \log{\frac{1}{\delta}}\right) \le \delta
\end{equation*}
Since $\alpha_t=\eta_t\E_{t-1}[(X_t-\mu_t)^2]$ and $\E_{t-1}[(X_t-\mu_t)^2]\le\E_{t-1}[X_t^2]$,
\begin{gather*}
    \P\left(\sum_{t=1}^T\eta_t(X_t-\mu_t) \ge \log{\frac{1}{\delta}}+\sum_{t=1}^T\eta_t^2\E_{t-1}[(X_t-\mu_t)^2]\right) \le \delta \\
    \P\left(\sum_{t=1}^T\eta_t(X_t-\mu_t) \ge \log{\frac{1}{\delta}}+\sum_{t=1}^T\eta_t^2\E_{t-1}[X_t^2]\right) \le \delta
\end{gather*}
\qedsymbol

To use \Cref{thm:bernstein} for the last term in \eqref{eq:decom}, let $X_t=\sum_{x\in\st_l,a\in\act}\rho(x,a)[\rvcosth_t(x,a)-M_t(x,a)]$ and $\eta_t=\eta=\frac{\gamma}{\|\cost-M\|_{\st_l}}$, where $\|\cost-M\|_{\st_l}=\max_{t=1,\dots,T}\max_{x\in\st_l,a\in\act}|\cost_t(x,a)-M_t(x,a)|$. Then, by \eqref{eq:cost-proposed},
\begin{align*}
    \mu_t &= \E_{t-1}\left[\sum_{x\in\st_l,a\in\act}\rho(x,a)[\rvcosth_t(x,a)-M_t(x,a)]\right] \\
    &= \E_{t-1}\left[\sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} \ind\{(x,a) \in \bar{\u}_L(t)\}\right] \\
    &= \sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\rho_t(x,a)[\cost_t(x,a)-M_t(x,a)]}{\rho_t(x,a)+\gamma}
\end{align*}
By \Cref{thm:bernstein}, with probability $1-\delta'$,
\begin{align*}
    &\sum_{t=1}^T\frac{\gamma}{\|\cost-M\|_{\st_l}}\left(\sum_{x\in\st_l,a\in\act}\rho(x,a)\left[\rvcosth_t(x,a)-M_t(x,a)-\frac{\rho_t(x,a)[\cost_t(x,a)-M_t(x,a)]}{\rho_t(x,a)+\gamma}\right]\right) \\
    &\le \log{\frac{1}{\delta'}}+\sum_{t=1}^T\left(\frac{\gamma}{\|\cost-M\|_{\st_l}}\right)^2\E_{t-1}\left[\left(\sum_{x\in\st_l,a\in\act}\rho(x,a)[\rvcosth_t(x,a)-M_t(x,a)]\right)^2\right] \\
\end{align*}
By \eqref{eq:cost-proposed},
\begin{align*}
    &\sum_{t=1}^T\frac{\gamma}{\|\cost-M\|_{\st_l}}\left(\sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} [\ind\{(x,a) \in \bar{\u}_L(t)\}-\rho_t(x,a)]\right) \\
    &\le \log{\frac{1}{\delta'}}+\sum_{t=1}^T\left(\frac{\gamma}{\|\cost-M\|_{\st_l}}\right)^2\sum_{x\in\st_l,a\in\act}\left(\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\right)^2\rho_t(x,a) \\
\end{align*}
Since $\rho_t(\cdot)\ge0$ and $\gamma>0$,
\begin{align*}
    &\frac{\gamma}{\|\cost-M\|_{\st_l}}\sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma} [\ind\{(x,a) \in \bar{\u}_L(t)\}-\rho_t(x,a)] \\
    &\le \log{\frac{1}{\delta'}}+\frac{\gamma}{\|\cost-M\|_{\st_l}}\sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\left(\rho(x,a)^2\frac{[\cost_t(x,a)-M_t(x,a)]^2}{\rho_t(x,a)+\gamma}\right)\frac{\gamma}{\|\cost-M\|_{\st_l}} \\
\end{align*}
\begin{align*}
    &\frac{\gamma}{\|\cost-M\|_{\st_l}}\sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}  \\
    &\qquad\left[\ind\{(x,a) \in \bar{\u}_L(t)\}-\rho_t(x,a)-\frac{\gamma\rho(x,a)[\cost_t(x,a)-M_t(x,a)]}{\|\cost-M\|_{\st_l}}\right] \\
    &\le \log{\frac{1}{\delta'}}
\end{align*}
Since $\rho(x,a),\rho_t(x,a)\ge0$, $\gamma>0$ and $M_t(x,a)\le c_t(x,a)$, by $\rho(x,a)\le1$ and $c_t(x,a)-M_t(x,a)\le\|c-M\|_{\st_l}$,
\begin{equation*}
    \frac{\gamma}{\|\cost-M\|_{\st_l}}\sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\rho(x,a)\frac{\cost_t(x,a)-M_t(x,a)}{\rho_t(x,a)+\gamma}\left[\ind\{(x,a) \in \bar{\u}_L(t)\}-\rho_t(x,a)-\gamma\right] \le \log{\frac{1}{\delta'}}
\end{equation*}
By \eqref{eq:cost-proposed},
\begin{equation*}
    \frac{\gamma}{\|\cost-M\|_{\st_l}}\sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\rho(x,a)[\rvcosth_t(x,a)-\cost_t(x,a)] \le \log{\frac{1}{\delta}}
\end{equation*}
For each layer $l$, with probability $1-\delta'$,
\begin{equation*}
    \sum_{t=1}^T\sum_{x\in\st_l,a\in\act}\rho(x,a)[\rvcosth_t(x,a)-\cost_t(x,a)] \le \frac{\|\cost-M\|_{\st_l}}{\gamma}\log{\frac{1}{\delta'}}
\end{equation*}
By union bound on all layers, with probability $1-L\delta'$,
\begin{align*}
    \sum_{t=1}^T\sum_{x,a}\rho(x,a)[\rvcosth_t(x,a)-\cost_t(x,a)] &\le \sum_{l=1}^L\frac{\|\cost-M\|_{\st_l}}{\gamma}\log{\frac{1}{\delta'}} \\
    &\le \frac{L}{\gamma}\log{\frac{1}{\delta'}}\max_{l=1,\dots,L}\|\cost-M\|_{\st_l} \\
    &= \frac{L}{\gamma}\log{\frac{1}{\delta'}}\max_{t=1,\dots,T}\|\cost_t-M_t\|_\infty
\end{align*}
Setting $\delta'=\delta/L$, with probability $1-\delta$, the last term in \eqref{eq:decom} is bounded as
\begin{equation}
    \sum_{t=1}^T\langle\rho^*,\rvcosth_t-c_t\rangle \le \frac{L}{\gamma}\log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|c_t-M_t\|_\infty. \label{eq:decom4}
\end{equation}
Applying \eqref{eq:decom1}, \eqref{eq:decom2}, \eqref{eq:decom3} and \eqref{eq:decom4} back to \eqref{eq:decom},
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma^2}\sum_{t=1}^T\|\cost_t-M_t\|_\infty^2 + \gamma\sum_{t=1}^T\|\cost_t-M_t\|_1 \\
    &\qquad+ \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2} + \frac{L}{\gamma}\log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|c_t-M_t\|_\infty \\
    &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma^2}\sum_{t=1}^T\|\cost_t-M_t\|_\infty^2 + \gamma\sum_{t=1}^T\|\cost_t-M_t\|_1 \\
    &\qquad+ \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2} + \frac{L}{\eta}\log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|c_t-M_t\|_\infty \\
    &\qquad\text{(Since $\eta\le\gamma$ if $\gamma=\eta^{1/3}$ and $\eta,\gamma\le1$)}
\end{align*}
Let $\eta=\left(\frac{L\log{\frac{|\st||\act|}{L}}+L\log{\frac{L}{\delta}}\max_t\|c_t-M_t\|_\infty}{\sum_{t=1}^T\frac{\|c_t-M_t\|_\infty^2}{2}+\|c_t-M_t\|_1}\right)^{3/4}$ and $\gamma=\eta^{1/3}$.
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &\le \left(L\log{\frac{|\st||\act|}{L}}+L\log{\frac{L}{\delta}}\max_t\|c_t-M_t\|_\infty\right)^{1/4} \\
    &\qquad\cdot\left(\sum_{t=1}^T\frac{\|c_t-M_t\|_\infty^2}{2}+\|c_t-M_t\|_1\right)^{3/4} \\
    &\qquad+ \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}
\end{align*}
\qedsymbol

\subsection{Proof of Theorem \ref{theorem:doubling_exp}}
\label{proof:doubling_exp}
The expected total regret can be decomposed into local regrets of each phase as below, where $N>1$ denotes the number of phases that $T$ episodes are broken into.
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \cost_t\rangle\right] = \E\left[\sum_{i=1}^N\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*,\cost_t\rangle\right]
\end{equation*}
Since the randomness of the future does not affect the past,
\begin{align}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \cost_t\rangle\right] &= \sum_{i=1}^N \E_{\u(s_{i+1}-1)}\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t,\cost_t\rangle-\sum_{t=s_i}^{s_{i+1}-1}\langle\rho^*,\cost_t\rangle\right] \nonumber\\
    &\le \sum_{i=1}^N \E_{\u(s_{i+1}-1)}\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t,\cost_t\rangle-\min_{\rho\in\Delta(\mdp)}\sum_{t=s_i}^{s_{i+1}-1}\langle\rho,\cost_t\rangle\right]. \label{eq:phase_decomp}
\end{align}
Now we can consider solving for the upper bound as the problem of each phase independently, where the local regret of each phase is expressed with respect to its local optimum: $\rho^*_i=\arg\min_{\rho\in\Delta(\mdp)}\sum_{t=s_i}^{s_{i+1}-1}\langle\rho,\rvcosth_t\rangle$. If $\rho_{t+1}$ is the solution of \eqref{eq:OREPS-OPIX} with $\eta=\eta_i$ and $t=s_i,\dots,s_{i+1}-1$, by \Cref{theorem:bandit_exp},
\begin{equation*}
    \E\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle\right] \le \frac{L}{\eta_i}\log\frac{|\st||\act|}{L} + \frac{\eta_i}{2\gamma_i}\sum_{t=s_i}^{s_{i+1}-1}\left\|\cost_t-M_t\right\|_2^2 + \sum_{t=s_i}^{s_{i+1}-1}\gamma_i\|\cost_t-M_t\|_1.
\end{equation*}
Note that the term $\frac{L}{\eta_i}\log\frac{|\st||\act|}{L}$ represents the initial suboptimality assuming that $\rho_{s_i}$ is initialized as the uniform distribution. However, the logic behind regularizing the Bregman divergence (between the current and past occupancy measures) is that the occupancy measure learned in an episode will suffer a lower cost in the next episode than random initialization. Therefore the bound conservatively holds for $\rho_{s_i}$ learned from the previous phase instead of initializing it every phase.

By \Cref{alg:doubling}, use $\gamma_i=\sqrt{\eta_i}$.
\begin{align*}
    \E\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle\right] &\le \frac{L}{\eta_i}\log\frac{|\st||\act|}{L} + \frac{\eta_i}{2\gamma_i}\sum_{t=s_i}^{s_{i+1}-1}\left\|\cost_t-M_t\right\|_2^2 + \sum_{t=s_i}^{s_{i+1}-1}\gamma_i\|\cost_t-M_t\|_1 \\
    &= \frac{L}{\eta_i}\log\frac{|\st||\act|}{L} + \sqrt{\eta_i}\sum_{t=s_i}^{s_{i+1}-1}\left\{\frac{1}{2}\left\|\cost_t-M_t\right\|_2^2 + \|\cost_t-M_t\|_1\right\}.
\end{align*}
\textcolor{black}{Since $\E\left[|\bar{c}_t(x,a)-M_t(x,a)|\right] = |c_t(x,a)-M_t(x,a)|$ with respect to the randomness of the trajectory $\bar{\u}_L(t)$,}
\begin{equation*}
\color{black}
    \E\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle\right] \le \frac{L}{\eta_i}\log\frac{|\st||\act|}{L} + \sqrt{\eta_i}\E\left[\sum_{t=s_i}^{s_{i+1}-1}\left\{\frac{1}{2}\left\|\bar{c}_t-M_t\right\|_2^2 + \|\bar{c}_t-M_t\|_1\right\}\right].
\end{equation*}

By step 4 of \Cref{alg:doubling}, $\eta_i^{-1}D_0\ge\sqrt{\eta_i}\Psi_{s_i:s_{i+1}-1}$ for all $i=1,\dots,N$.
\begin{equation*}
    \E\left[\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle\right] \le \frac{2L}{\eta_i}\log\frac{|\st||\act|}{L}
\end{equation*}
Adding over $N$ phases, by \Cref{eq:phase_decomp},
\begin{equation*}
    \E\left[\sum_{t=1}^T\langle\rho_t-\rho^*, \cost_t\rangle\right] = L\log\frac{|\st||\act|}{L}\left(2\sum_{i=1}^N \frac{1}{\eta_i} \right) = D_0\left(2\sum_{i=1}^N \frac{1}{\eta_i} \right).
\end{equation*}
From $\sum_{i=1}^N(1/2)^i\le1$ and $\eta_{N-1}=\eta_0/2^{N-1}$,
\begin{equation}
    2\sum_{i=1}^N\frac{1}{\eta_i} = \frac{2}{\eta_0}\sum_{i=1}^N2^i = \frac{2}{\eta_0}2^{N+1}\sum_{i=1}^N2^{i-N-1} = \frac{2^{N+2}}{\eta_0}\sum_{i=1}^N(\frac{1}{2})^i \le \frac{2^{N+2}}{\eta_0}=\frac{8}{\eta_{N-1}}. \label{eq:eta_sum}
\end{equation}
By $\sqrt{\eta_{N-1}}\Psi_{s_{N-1}:s_N}>\eta_{N-1}^{-1}D_0\ge\sqrt{\eta_{N-1}}\Psi_{s_{N-1}:s_N-1}$ and the monotonicity of $\Psi$,
\begin{gather*}
    \eta_{N-1}^{-3/2} < \frac{\Psi_{s_{N-1}:s_N}}{D_0} \le \frac{\Psi_{1:T}}{D_0}.
\end{gather*}
Then,
\begin{align*}
    \E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] &\le D_0\frac{8}{\eta_{N-1}} \\
    &< 8D_0\left(\frac{\Psi_{1:T}}{D_0}\right)^{2/3} \\
    &= 8D_0^{1/3}\Psi_{1:T}^{2/3} \\
    &= \textcolor{black}{8\left(L\log\frac{|\st||\act|}{L}\right)^{1/3}\left(\sum_{t=1}^{T}\frac{1}{2}\|\bar{c}_t-M_t\|_2^2 + \|\bar{c}_t-M_t\|_1\right)^{2/3}}.
\end{align*}
\textcolor{black}{Taking expectation over the randomness of the trajectory $\bar{\u}_L(T)$,}
\begin{equation*}
\color{black}
    \E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] \le 8\left(L\log\frac{|\st||\act|}{L}\right)^{1/3}\left(\sum_{t=1}^{T}\frac{1}{2}\|\cost_t-M_t\|_2^2 + \|\cost_t-M_t\|_1\right)^{2/3}.
\end{equation*}
% \begin{align*}
%     \E[\reg_T(\rho^*,\{\cost_t\}_{t=1}^{T})] &\le D_0\frac{8}{\eta_{N-1}} \\
%     &< 8D_0\left(\frac{\Psi_{1:T}}{D_0}\right)^{2/3} \\
%     &= 8D_0^{1/3}\Psi_{1:T}^{2/3} \\
%     &= 8\left(L\log\frac{|\st||\act|}{L}\right)^{1/3}\left(\sum_{t=1}^{T}\frac{1}{2}\|\bar{M}_t\|_2^2 + \|\bar{M}_t\|_1\right)^{2/3}.
% \end{align*}
Note that determining $\eta_i^{-1}D_0<\sqrt{\eta_i}\Psi_{s_i:t}$ for each episode $t$ does not require additional suffering of cost. As in \Cref{alg:doubling}, determining $\eta_i$ and $\gamma_i$ can come after the entire rollout of episode $t$, as they are only needed for computing $\hat{c}_t$, $M_{t+1}$ and $\rho_{t+1}$. Therefore this is the final bound unlike Lemma 12 of \cite{rakhlin2013online}, which suffers additional cost for finding $\Psi$.
\qedsymbol

\subsection{Proof of Theorem \ref{theorem:doubling_full}}
The proof is similar to \Cref{proof:doubling_exp} but simpler. Again, the total regret can be decomposed into local regrets of each phase.
\begin{align*}
    \sum_{t=1}^T \langle c_t,\rho_t-\rho^*\rangle &= \sum_{i=1}^N\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*,\cost_t\rangle \\
    &\le \sum_{i=1}^N\sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t,\cost_t\rangle-\min_{\rho\in\Delta(\mdp)}\sum_{t=s_i}^{s_{i+1}-1}\langle\rho,\cost_t\rangle
\end{align*}
Now the upper bound is the problem of optimizing \eqref{eq:OREPS-OPIX} with $\eta=\eta_i$ and $t=s_i,\dots,s_{i+1}-1$. By \Cref{theorem:full},
\begin{equation*}
    \sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle \le \frac{L}{\eta_i}\log\frac{|\st||\act|}{L} + \sum_{t=1}^T\frac{\eta_i}{2}\|\cost_t-M_t\|_\infty^2
\end{equation*}
where $\rho^*_i=\arg\min_{\rho\in\Delta(\mdp)}\sum_{t=s_i}^{s_{i+1}-1}\langle\rho,\cost_t\rangle$.

By $\eta_i^{-1}D_0\ge\eta_i\Psi_{s_i:s_{i+1}-1}$ according to our doubling trick algorithm,
\begin{equation*}
    \sum_{t=s_i}^{s_{i+1}-1}\langle\rho_t-\rho^*_i,\cost_t\rangle \le \frac{2L}{\eta_i}\log\frac{|\st||\act|}{L}.
\end{equation*}
Adding over $N$ phases,
\begin{equation*}
    \sum_{t=1}^T \langle c_t,\rho_t-\rho^*\rangle \le L\log\frac{|\st||\act|}{L} \left(2\sum_{i=1}^N\frac{1}{\eta_i}\right)
\end{equation*}
By \eqref{eq:eta_sum}, $\eta_{N-1}^{-1}D_0 < \eta_{N-1}\Psi_{s_{N-1}:s_N}$ and the monotonicity of $\Psi$,
\begin{align*}
    \reg_T(\rho^*,\{\cost_t\}_{t=1}^{T}) &\le D_0\frac{8}{\eta_{N-1}} \\
    &< 8D_0\left(\frac{\Psi_{1:T}}{D_0}\right)^{1/2} \\
    &= 8D_0^{1/2}\Psi_{1:T}^{1/2} \\
    &= 8\sqrt{\frac{L}{2}\log\frac{|\st||\act|}{L}\sum_{t=1}^T\|c_t-M_t\|_\infty^2}
\end{align*}
\qedsymbol


\subsection{Proof of Theorem \ref{theorem:unknown_transition_hp}}
Since the constraint set of occupancy measure is unknown, the regret of OREPS-OPIX under unknown transition setting can be decomposed as \eqref{eq:uc-decomp}. Note the additional error term $\rho_t-\hat{\rho}_t$ as opposed to \eqref{eq:decom} used for the analysis of \Cref{theorem:bandit_hp}.
\begin{equation}
    \mathcal{R}_T(\rho^*,\{c_t\}_{t=1}^T) = \sum_{t=1}^T\left[\langle \rho_t-\hat{\rho}_t,c_t\rangle+\langle\hat{\rho}_t,c_t-\rvcosth_t\rangle+\langle\hat{\rho}_t-\rho^*,\rvcosth_t\rangle+\langle \rho^*,\rvcosth_t-c_t\rangle\right]. \label{eq:uc-decomp}
\end{equation}

\begin{lemma}[Lemma 5 of \citet{jin2020learning}]
\label{lemma:uob-oreps-error}
    With probability at least $1-6\delta$, for $\hat{\rho}_t$ estimated with $\rho^{P,\pi_t}$ under transition probability $P\in\mathcal{P}$,
    where the confidence set $\mathcal{P}$ is defined as \eqref{eq:confidence-set},
    \begin{equation*}
        \sum_{t=1}^T\langle \rho_t-\hat{\rho}_t,c_t\rangle = \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}\right).
    \end{equation*}
\end{lemma}

By \Cref{lemma:uob-oreps-error}, with probability at least $1-6\delta$, the first term is bounded as
\begin{equation}
    \sum_{t=1}^T\langle \rho_t-\hat{\rho}_t,c_t\rangle = \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}\right).
    \label{eq:uc-decomp1}
\end{equation}

The second term can be decomposed further as
\begin{equation*}
    \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\rvcosth_t\rangle = \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\E_{t-1}[\rvcosth_t]\rangle + \sum_{t=1}^T\langle\hat{\rho}_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle.
\end{equation*}
From the definition of our cost estimator with the upper confidence bound,
\begin{align*}
    \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\E_{t-1}[\rvcosth_t]\rangle &= \sum_{t=1}^T\sum_{x,a}\hat{\rho}_t(x,a)\left[c_t(x,a)-\frac{c_t(x,a)-M_t(x,a)}{u_t(x,a)+\gamma}\rho_t(x,a)-M_t(s,a)\right] \\
    &= \sum_{t=1}^T\sum_{x,a}\frac{\hat{\rho}_t(x,a)(c_t(x,a)-M_t(x,a))}{u_t(x,a)+\gamma}(u_t(x,a)+\gamma-\rho_t(x,a)).
\end{align*}
By $M_t(x,a)\le c_t(x,a) \le M_t(x,a)+1$, $\hat{\rho}_t(\cdot)\ge0$ and $u_t(x,a)\ge\hat{\rho}_t(x,a)$,
\begin{align*}
    \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\E_{t-1}[\rvcosth_t]\rangle &\le \sum_{t=1}^T\sum_{x,a}(c_t(x,a)-M_t(x,a))(u_t(x,a)+\gamma-\rho_t(x,a)) \nonumber\\
    &\le \sum_{t=1}^T\sum_{x,a}|u_t(x,a)-\rho_t(x,a)|+(c_t(x,a)-M_t(x,a))\gamma \nonumber\\
    &= \sum_{t=1}^T\sum_{x,a}|u_t(x,a)-\rho_t(x,a)|+\gamma\sum_{t=1}^T\|c_t-M_t\|_1.
\end{align*}

\begin{lemma}[Lemma 4 of \cite{jin2020learning}]
\label{lemma:occupancy-error}
    With probability at least $1-6\delta$, for transition functions $P_t^x\in\mathcal{P}$ for all states $x\in\st$, where the confidence set $\mathcal{P}$ is defined as \eqref{eq:confidence-set}, the cumulative error of occupancy measure with respect to $\rho_t$ of known transition setting is bouned as
    \begin{equation*}
        \begin{aligned}
            &\sum_{t=1}^T\sum_{x\in\st,a\in\act}\left|\rho^{P_t^x,\pi_t}(x,a)-\rho_t(x,a)\right| \le \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}\right)
        \end{aligned}
    \end{equation*}
\end{lemma}
Since $u_t(x,a)=\max_{P\in\mathcal{P}}\rho^{P,\pi_t}(x,a)$, by \Cref{lemma:occupancy-error}, with probability at least $1-6\delta$,
\begin{equation*}
    \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\E_{t-1}[\rvcosth_t]\rangle \le \ O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}\right) + \gamma\sum_{t=1}^T\|c_t-M_t\|_1
\end{equation*}
Since $\left\{\langle\hat{\rho}_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle\right\}_{t=1}^T$ is a martingale difference sequence, by using the Azuma–Hoeffding inequality, with probability at least $1-\delta$,
\begin{equation*}
   \sum_{t=1}^T\langle\hat{\rho}_t,\E_{t-1}[\rvcosth_t]-\rvcosth_t\rangle \le \sqrt{2\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}.
\end{equation*}
With probability at least $1-7\delta$, the second term is bounded as
\begin{equation}
    \sum_{t=1}^T\langle\hat{\rho}_t,c_t-\rvcosth_t\rangle \le \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}+ \sqrt{\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}\right) + \gamma\sum_{t=1}^T\|c_t-M_t\|_1 \label{eq:uc-decomp2}
\end{equation}

Since $\hat{\rho}_t$ optimizes for $\rvcosth_t$, from the analyses of Theorem 1 and 3, the third term is bounded as
\begin{align}
    \sum_{t=1}^T\langle\hat{\rho}_t-\rho^*,\rvcosth_t\rangle &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \sum_{t=1}^T\frac{\eta}{2}\|\rvcosth_t-M_t\|_\infty^2 \nonumber\\
    &\le \frac{L}{\eta}\log\frac{|\st||\act|}{L} + \frac{\eta}{2\gamma^2}\sum_{t=1}^T\|c_t-M_t\|_\infty^2. \label{eq:uc-decomp3}
\end{align}

Since $u_t(x,a) \ge \rho_t(x,a)$, from the analysis of Theorem 3 using Lemma 2, the fourth term is bounded as
\begin{equation}
    \sum_{t=1}^T\langle \rho^*,\rvcosth_t-c_t\rangle \le \frac{L}{\gamma}\log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|c_t-M_t\|_\infty. \label{eq:uc-decomp4}
\end{equation}

Finally, applying \ref{eq:uc-decomp1}, \ref{eq:uc-decomp2}, \ref{eq:uc-decomp3} and \ref{eq:uc-decomp4} back to \ref{eq:uc-decomp} and letting $\gamma=\eta^{1/3}$, with probability at least $1-7\delta$,
\begin{align*}
    \mathcal{R}_T(\rho^*,\{c_t\}_{t=1}^T) &\le \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}+ \sqrt{\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}\right) \\
    &\qquad+ \eta^{1/3}\sum_{t=1}^T\left[\frac{\|c_t-M_t\|_\infty^2}{2}+\|c_t-M_t\|_1\right] \\
    &\qquad+ \frac{L}{\eta}\left[\log\frac{|\st||\act|}{L} + \log{\frac{L}{\delta}}\max_{t=1,\dots,T}\|c_t-M_t\|_\infty\right]
\end{align*}

Let $\eta=\left(\frac{L\log{\frac{|\st||\act|}{L}}+L\log{\frac{L}{\delta}}\max_t\|c_t-M_t\|_\infty}{\sum_{t=1}^T\frac{\|c_t-M_t\|_\infty^2}{2}+\|c_t-M_t\|_1}\right)^{3/4}$,
\begin{align*}
    \mathcal{R}_T(\rho^*,\{c_t\}_{t=1}^{T}) &\le \left(L\log{\frac{|\st||\act|}{L}}+L\log{\frac{L}{\delta}}\max_t\|c_t-M_t\|_\infty\right)^{1/4} \\
    &\qquad\cdot\left(\sum_{t=1}^T\frac{\|c_t-M_t\|_\infty^2}{2}+\|c_t-M_t\|_1\right)^{3/4} \\
    &\qquad+ \O\left(L|\st|\sqrt{|\act|T\log\left(\frac{T|\st||\act|}{\delta}\right)}+ \sqrt{\log{\frac{1}{\delta}}\sum_{t=1}^T\|c_t-M_t\|_1^2}\right)
\end{align*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Experiments}
\subsection{Experimental details}
We provide the details of the experiment in \Cref{sec:exp} as \Cref{table:main_exp}. Additionally, we specify $t_m=1000$ episodes between the change of obstacle locations for better predictability. Also, agent's starting location was randomly assigned at the beginning of each episode and the goal location was fixed across episodes. And all three obstacles moved randomly every $t_m$ episodes, but in a restricted manner so that they do not obstruct the way from the starting point to the goal: that is, there is always a way from the start to the goal without encountering any obstacles. Lastly, the experiment was repeated ten times and the mean and variance of ten repetitions are shown in \Cref{fig:experiment_oreps}.

\begin{table}[hbp]
  \caption{Parameters used in the experiments}
  \label{table:main_exp}
  \centering
  \begin{tabular}{lll}
    \toprule
    Parameter   & Description   & Value \\
    \midrule
    $\epsilon$  & Default cost  & 0.01     \\
    $L$         & Timeout (number of layers) & 200      \\
    $\eta_\text{OREPS}$ & Learning rate for OREPS and OREPS-IX       & $2.1\times10^{-3}$  \\
    $\eta_\text{OREPS-OPIX}$    & Learning rate for OREPS-OPIX  & $0.2$ \\
    \bottomrule
  \end{tabular}
\end{table}

From \cite{zimin2013online}, the learning rate for OREPS and OREPS-IX was determined as $\eta_\text{OREPS}=\sqrt{L\frac{\log\frac{|\st||\act|}{L}}{T|\st||\act|}}$. However, since the perfect predictor we used for OREPS-OPIX has zero error for cost estimation, we can set an arbitrarily high learning rate as long as it is less than 1 (for the high probability guarantee in \Cref{theorem:bandit_hp}). After a sparse exploration of parameters, we chose $\eta_\text{OREPS-OPIX}=0.2$. With a higher learning rate, the algorithm converges even faster at the cost of higher variance. And the same learning rate was used for OREPS-OPIX with latest predictors.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\end{document}
