%\documentclass{uai2023} % for initial submission
 \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{amsfonts}

\usepackage[ruled,vlined]{algorithm2e}
%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{assumption}[theorem]{Assumption}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{remark}[theorem]{Remark}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\title{Learning in Online MDPs: \\Is there a Price for Handling the Communicating Case?}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<gautamc@cs.utexas.edu>?Subject=Your UAI 2023 paper}{Gautam Chandrasekaran}}
\author[2]{Ambuj Tewari}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science\\
    University of Texas at Austin\\
    Austin, Texas, USA
}
\affil[2]{%
    Department of Statistics\\
    University of Michigan\\
    Ann Arbor, Michigan, USA}
  
  \begin{document}
\maketitle

\begin{abstract}
  It is a remarkable fact that the same $O(\sqrt{T})$ regret rate can be achieved in both the Experts Problem and the Adversarial Multi-Armed Bandit problem albeit with a worse dependence on number of actions in the latter case. In contrast, it has been shown that handling online MDPs with communicating structure and bandit information incurs $\Omega(T^{2/3})$ regret even in the case of deterministic transitions. Is this the price we pay for handling communicating structure or is it because we also have bandit feedback? In this paper we show that with full information, online MDPs can still be learned at an $O(\sqrt{T})$ rate even in the presence of communicating structure. We first show this by proposing an efficient follow the perturbed leader (FPL) algorithm for the deterministic transition case. We then extend our scope to consider stochastic transitions where we first give an inefficient $O(\sqrt{T})$-regret algorithm (with a mild additional condition on the dynamics). Then we show how to achieve $O\left(\sqrt{\frac{T}{\alpha}}\right)$ regret rate using an oracle-efficient algorithm but with the additional restriction that the starting state distribution has mass at least $\alpha$ on each state.
\end{abstract}

\section{Introduction}\label{sec:intro}

In this work, we study online learning in Markov Decisions Processes. In this setting, we have an \textit{agent} interacting with an adversarial \textit{environment}. The agent observes the state of the environment and takes an action. The action incurs an associated loss and the environment moves to a new state. The state transition dynamics are assumed to be Markovian, i.e., the probability distribution of the new state is fully determined by the action and the old state. The transition dynamics are fixed and known to the learner in advance. However the losses are chosen by the adversary. The adversary is assumed to be oblivious (the entire loss sequence is chosen before the interaction begins).  We assume that the environment reveals {\it full information} about the losses at a given time step to the agent after the corresponding action is taken. The total loss incurred by the agent is the sum of losses incurred in each step of the interaction. We denote the set of states by $S$ and the set of actions by $A$. The objective of the agent is to minimized its total loss. 



 This setting was first studied in the seminal work of \citet{10.2307/40538442}. They studied the restricted class of \textit{ergodic} MDPs where every policy induces a Markov chain with a single recurrent class. They designed an efficient (runs in polytime in MDP parameters and time of interaction) algorithm that achieved $O(\sqrt{T})$ regret with respect to the best stationary policy in hindsight. They assumed full information of the losses and that the MDP dynamics where known beforehand. This work was extended to {\it bandit feedback} by \citet{DBLP:journals/tac/NeuGSA14}\footnote[1]{with an additional assumption on the minimum stationary probability mass in any state}. They also achieved a regret bound of $O(\sqrt{T})$. Bandit feedback is a harder model in which the learner only receives information corresponding to the losses of the actions it takes.


\begin{table}[]

\begin{tabular}{l|lll}
 & \multicolumn{1}{l}{No state} & \multicolumn{1}{l}{Ergodic} & \multicolumn{1}{l}{Communicating} \\ \hline
\multicolumn{1}{l|}{Full Info} & $\sqrt{T}$ & $\sqrt{T}$ & $\sqrt{T}$\textbf{\hspace{4pt} \fbox{\tiny THIS PAPER}\vspace{2pt}} \\ \hline
\multicolumn{1}{l|}{Bandit Info} & $\sqrt{T}$ & $\sqrt{T}$\footnotemark[1] & $T^{2/3}$
 \\ \hline
\end{tabular}
\caption{The dependence on time horizon $T$ of the optimal regret, under full and bandit feedbacks, as the state transition dynamics become more complex.}

\label{tab:rates}
\end{table}

 In this paper we will look at the more general class of {\em communicating} MDPs, where, for any pair of states, there is a policy such that the time it takes to reach the second state from the first has finite expectation. In the case of bandit feedback with deterministic transitions,  \cite{10.5555/3042817.3043012} designed an algorithm that achieved $O\left(T^{2/3}\right)$ regret. This regret bound was proved to be tight by a matching lower bound in \citet{switch}. This regret lower bound was proved by a reduction from the problem of adversarial multi-armed bandits with \textit{switching costs}. In this setting, the agent incurs an additional cost every time it switches the arm it plays. Their lower bound definitively proves that in the case of bandit information, online learning over communicating MDPs is {\em statistically harder} than the adversarial multi armed bandits problem for which we have $\tilde{O}(\sqrt{T})$ regret algorithms (\citet{EXP3}). Their result gives rise to the natural question: is the high regret due to the communicating structure or bandit feedback(or both)? In the case of experts with switching cost, we know $O(\sqrt{T})$ regret algorithms such as FPL (\cite{KALAI2005291}). Using this, we give an $O(\sqrt{T})$ algorithm for online learning in communicating MDPs with full information. Thus, we show that having communicating structure alone does not add any statistical price (see Table~\ref{tab:rates}). 
 
\subsection{Our Contributions}
In this paper, we show that online learning over MDPs with full information is {\em not statistically harder\footnote[2]{up to polynomial factors in the number of states and actions}\footnote[3]{assuming the existence of a state with a ``do nothing" action}} than the problem of online learning with expert advice. In particular, we design an efficient algorithm that learns to act in Communicating Deterministic MDPs (ADMDPs) with $O(\sqrt{T})$ regret under full information feedback against the best deterministic policy in hindsight. This is the first $O(\sqrt{T})$ regret algorithm for this problem. To achieve this bound, we designed a follow the perturbed leader (\cite{KALAI2005291}) style algorithm where we achieve low regret with respect to the set of exponentially many policies with the additional guarantee that our algorithm does not switch policies too much. Since the number of policies is exponential, a naive implementation of FPL will not work. We had to carefully choose a polynomial number of perturbations (as opposed to exponential in naive FPL) such that the algorithm worked. We believe this is one of the sources of technical novelties in our paper.


We prove a matching\footnotemark[2] regret lower bound in this setting. We also extend the techniques used in the previous algorithm to design an algorithm that runs in time exponential in MDP parameters that achieves $O(\sqrt{T})$ regret in the general class of communicating MDPs (albeit with an additional mild assumption\footnotemark[3]). Again, this is the first algorithm that achieves $O(\sqrt{T})$ regret against this large class of MDPs. Before this, $O(\sqrt{T})$ regret algorithms were only known for the case of ergodic MDPs. En route to this, we designed the Switch\_Policy procedure(Algorithm~{\ref{alg:alpha}}) to catch the distribution induced to by a new policy in time $O(D^2)$ where $D$ is the diameter of the MDP. This was subsequently used by \citet{ftpl_mdp_bandit} where they prove analogous results to ours in the bandit case. 

Finally, we study the problem of designing oracle-efficient algorithms. We give an $O\left(\sqrt{\frac{T}{\alpha}}\right)$ regret algorithm for communicating MDPs with a  start state distribution having probability mass at least $\alpha$ on each state that is efficient when given access to an optimization oracle.
%An MDP is \textit{ergodic} if every policy induces a Markov chain with a single recurrent class. This implies that every policy has a well defined stationary distribution \ambuj{I guess with aperiodicity also assumed}. This also implies the existence of a parameter $\tau>0$ which can be interpreted as a measure of the time before any policy starts receiving its average reward. The parameter $\tau$ is called the \textit{mixing time} of the MDP. The regret of both the results mentioned earlier have a polynomial dependence on this mixing time parameter. This assumption of \textit{uniform mixing} is quite restrictive. A weaker assumption would be to consider the class of \textit{communicating} MDPs. In communicating MDPs, for any pair of states, there exists a policy that takes the first state to the second with non-zero probability.
%\ambuj{I think this para should be expanded and moved into preliminaries. In particular, we should clarify the distinction between ergodic and uniform mixing assumptions. Note that, as discussed, ergodic immediately implies uniform mixing over all deterministic stationary policies (simply because there are only finitely many such policies). Whether ergodic implies uniform mixing over over all {\em stochastic} stationary policies (an infinite set) is not so clear.}

\section{Related Work}
% REDUNDANT: REPEAT FROM INTROThe problem we study in this paper is commonly referred to in literature as online learning in MDPs over an infinite horizon. This problem was first studied for MDPs with an ergodic transition structure. \citet{10.2307/40538442} and \citet{DBLP:journals/tac/NeuGSA14} studied this problem under full information and partial information respectively. The former achieved $O(\sqrt{T})$ regret for all ergodic MDPs, whereas the latter achieved $O(\sqrt{T})$ regret for ergodic MDPs satisfying an additional assumption\footnotemark[1]. The problem of online learning in deterministic communicating MDPs(ADMDPs) was studied by \citet{10.5555/3020652.3020666} and \cite{10.5555/3042817.3043012}. They consider bandit feedback and achieve $O(T^{3/4})$ and $O(T^{2/3})$ respectively.

As mentioned in the introduction, a closely related problem is that of online learning with switching costs. In the case of full information, algorithms like FPL (\cite{KALAI2005291}) achieves $O(\sqrt{T})$ regret with switching cost. In the case of bandit feedback, \cite{switching_ub} gives an algorithm that achieves $O(T^{2/3})$ regret with switching cost. This was proved to be tight by \cite{switch} where they proved a matching lower bound.

Another related problem is that of designing oracle-efficient algorithms (studied in \cite{dudik},\cite{block},\cite{Haghtalab} ). Designing oracle efficient algorithms is challenging since all the main computational steps in the algorithm need to be in the form of oracle calls and this restricts the design space of algorithms.


Subsequent to the release of an earlier version of this paper, \citet{ftpl_mdp_bandit} gave an inefficient $O(T^{2/3})$ and oracle-efficient $O(T^{5/6})$ regret algorithm for online learning over Communicating\footnotemark[2] MDPs with bandit information. Their algorithms use our Switch\_Policy procedure and thus requires the same assumption\footnotemark[3] as us.

\section{Preliminaries}
An Online Markov Decision Process consists of a  state space $S$, action space $A$, a transition probability matrix $P$ where $P(s,a,s')$ is the probability of moving from state $s$ to $s'$ on action $a$ and a sequence of loss functions(chosen by an oblivious adversary) $\ell_1,\ldots,\ell_T$ where each $\ell_t$ is a map from $S \times A$ to $[0,1]$.  In this paper, $S$ and $A$ will be finite sets.

In the case of Adversarial Deterministic MDPs(ADMDP), the transitions are deterministic and hence the ADMDP can also be represented by a directed graph $G$ with vertices corresponding to states $S$. The edges are labelled by the actions. An edge from $s$ to $s'$ labelled by action $a$ exists in the graph when the ADMDP takes the state $s$ to state $s'$ on action $a$. This edge will be referred to as $(s,a,s')$. 

A (stationary) policy $\pi$ is a mapping $\pi:S\times A\to[0,1]$ where  $\pi(s,a)$ denotes the probability of taking action $a$ when in state $s$. When the policy is deterministic, we overload the notation and define $\pi(s)$ to be the action taken when the state is $s$.
The interaction starts in an arbitrary start state $s_1 \in S$.

An algorithm $\mathcal{A}$ that interacts with the online MDP chooses the action to be taken at each time step. It maintains a probability distribution over actions denoted by $\mathcal{A}(.\mid s,\ell_1,\ldots,\ell_{t-1})$ which depends on the current state and the sequence of loss functions seen so far. 
The expected loss of the algorithm $\mathcal{A}$ is 
$$L(\mathcal{A})=\mathbb{E}\left[\sum_{t=1}^{T}{\ell_{t}(s_t,a_t)}\right]$$ where $a_t\sim \mathcal{A}\left(.\mid s_t,\ell_1,\ldots,\ell_{t-1}\right), s_{t+1} \sim P(\cdot,s_t,a_t)$
For a stationary policy $\pi$, the loss of the policy is 
$$L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_t(s_t,a_t)\right]$$ where $a_t\sim \pi(.\mid s_t), s_{t+1} \sim P(\cdot,s_t,a_t)$.
The regret of the algorithm is defined as
$$R(\mathcal{A})=L(\mathcal{A})-\min_{\pi\in \Pi}L^{\pi}\ .$$ 
The total expected loss of the best policy in hindsight is denoted by $L^*$. Thus,
$$L^*=\min_{\pi\in \Pi}L^{\pi} \ .$$

For any stationary policy $\pi$, let $T(s'\mid M,\pi,s)$ be the random variable for the first time step in which $s'$ is reached when we start at state $s$ and follow policy $\pi$ in MDP $M$.
We define the diameter $D(M)$ of the MDP as
$$D(M)=\max_{s\neq s'}\min_{\pi}\mathbb{E}\left[T(s'\mid M,\pi,s)\right].$$A {\it communicating MDP} is an MDP where $D(M) < \infty$.

\subsection{Preliminaries on ADMDPs}

In this section, we use the graph $G$ and the ADMDP interchangeably.
A stationary deterministic policy $\pi$ induces a subgraph $G_{\pi}$ of $G$ where $(s,a,s')$ is an edge in $G_{\pi}$ if and only if $\pi(s)=a$ and the action $a$ takes state $s$ to $s'$.

A communicating ADMDP corresponds to a strongly connected graph. This is because the existence of a policy that takes state $s$ to $s'$ also implies the existence of a path between the two vertices in the graph $G$. 

The subgraph $G_{\pi}$  induced by policy $\pi$ in the communicating ADMDP is the set of transitions $(s,a,s')$ that are possible under $\pi$. Each components of $G_\pi$ is either a cycle or an initial path followed by a cycle. Start a walk from any state $s$ by following the policy $\pi$. Since the set of states is finite, eventually a state must be repeated and this forms the cycle. 

Let $N(s,a)$ be the next state after visiting state $s$ and taking action $a$.
Define $I(s)$ as 
$$I(s)=\{(s',a)\mid N(s',a)=s\}.$$
The \textit{period} of a vertex $v$ in $G$ is the greatest common divisor of the lengths of all the cycles starting and ending at $v$. 
%give reference(its there in dekel, bremaud)
In a strongly, connected graph, the period of each vertex can be proved to be equal(\cite{markov_chains} Chap. 2, Thm 4.2). Thus, the period of a strongly connected $G$ is well defined. If the period of $G$ is 1, we call $G$ \textit{aperiodic}.

Let $\mathcal{C}_{(s,k)}$ be the set of all closed walks of $G$ of length $k$ such that the start vertex is $s$. The elements of ${\mathcal{C}_{(s,k)}}$ are represented by the sequence of edges in the walks.

Note that the cycles induced by any stationary deterministic policy $\pi$ that are of length $k$ and contain the vertex $s$ will be in $\mathcal{C}_{(s,k)}$. However, $\mathcal{C}_{(s,k)}$ can also contain cycles not induced by policies(it can contain cycles that are not simple). We use $\mathcal{C}$ to denote $\bigcup_{s\in S,k\in [k]} \mathcal{C}_{(s,k)}$.  We sometimes loosely refer elements of $\mathcal{C}$ as cycles. For a cycle $c$, we define $a_t(c)$ to be the action take by $c$ in the $t_{th}$ step if we start following $c$ from the beginning of the interaction. Similarly, $s_t(c)$ is the state that you reach after following $c$ for $t-1$ steps from the start of the interaction. We define $k(c)$ as the length of the cycle $c$.


The vertices of a strongly connected graph $G$ with period $\gamma$ can be partitioned into $\gamma$ non-empty cycle classes, $C_1,\ldots,C_{\gamma}$ where each edge goes $C_{i}$ to $C_{i+1}$.

\begin{theorem}
\label{thm:critical_length}
If $G$ is strongly connected and aperiodic, there exists a critical length $d$ such that for any $\ell\geq d$, there exists a path of length $\ell$ in $G$ between any pair of vertices. Also, $d\leq n(n-1)$ where $n$ is the number of vertices in the graph. 
\end{theorem}
The above theorem is from \cite{10.2307/3689120}. It guarantees the existence of a $d>0$ such that there are paths of length $d$ between any pair of vertices. The following generalization from \cite{10.5555/3042817.3043012} extends the result to periodic graphs.
\begin{theorem}[\cite{10.5555/3042817.3043012}]
\label{thm:critical_length_gen}
If $G$ has a period $\gamma$, there exists a critical value $d$ such that for any integer $\ell\geq d$, there is a path of in $G$ of length $\gamma\ell$ from any state $v$ to any other state in the same cycle class.
\end{theorem}
\begin{remark}
We can also  find the paths of length $\ell\geq d$ from a given vertex $s$ to any other vertex $s'$ efficiently. This can be done by constructing the path in the reverse direction. We look at $P^{\ell-1}$ to see all the predecessors of $s'$ that have paths of length $\ell-1$ from $s$. We choose any of these as the penultimate vertex in the path and recurse.
\end{remark}
\section{Deterministic Transitions}
We now present our algorithm for online learning in ADMDPs when we have full information of losses. We use $G$ to refer to the graph associated to the ADMDP.

We assume that the ADMDP dynamics are known to the agent. This assumption can be relaxed as shown in \citet{ORTNER20102684} as we can figure out the dynamics in poly($|S|,|A|$) time when the transitions are deterministic.  
%\ambuj{can sth similar be said for the stochastic case perhaps with an additional poly dependence on a diameter like quantity? also any such result will have to be approximate since transition probabilities can be estimated to arbitrary precision with finite data}
We want to minimize regret against the class of deterministic stationary policies. 

\subsection{Algorithm Sketch}
We formulate the task of minimizing regret against the set of deterministic policies as a problem of prediction with expert advice. As observed earlier, deterministic policies induce a subgraph which is isomorphic to a cycle with an initial path.   We keep an expert for each element of $\mathcal{C}_{(s,k)}$ for all states $s$ and $k\leq s$.  Note that we do not keep an expert for policies which have an initial path before the cycle. This is because the loss of these policies differ by at most $|S|$ compared to the loss of the cycle. Also, we make sure that the start state of the cycle is in the same cycle class as the start state of the environment. If this is not the case, our algorithm will never be \textit{in phase} with the expert policy. Henceforth, we will refer to these experts as cycles.

The loss incurred by cycle $c\in \mathcal{C}_{(s,k)}$ at time $t$ is equal to $\ell_t(s_t,a_t)$ where $s_t$ and $a_t$ are the state action pair traversed by the cycle $c$ at time $t$ if we had followed it from the start of the interaction.

We first present an efficient (running time polynomial in $|S|,|A|$ and $T$) algorithm to achieve $O(\sqrt{T})$ regret and switching cost against this class of experts. For this we used a \textit{Follow the perturbed leader} (FPL) style algorithm. 

We then use this low switching algorithm as a black box. Whenever, the black box algorithm tells us to switch policies at time $t$, we compute the state $s$ that we would have reached if we had followed the new policy from the start of the interaction and moved $t+\gamma d$ steps. We then move to this state $s$ in $\gamma d$ steps.  Theorem~\ref{thm:critical_length_gen} guarantees the existence of a path of this length. We then start following the new policy.

Thus, our algorithm matches the moves of the expert policies except when there is a switch in the policies. Thus, the regret of our algorithm differs from the regret of the black box algorithm by at most $O(\gamma d\sqrt{T})$.

\subsection{FPL algorithm}
We now describe the FPL style algorithm that competes with the set of cycles described earlier with $O(\sqrt{T})$ regret and switching cost.

\begin{algorithm}[]

    % Set Function Names
        Sample perturbation vectors $\epsilon_i\in \mathbb{R}^{|S||A|}$ for $1\leq i\leq |S|$ from an exponential distribution with parameter $\lambda$\;
        Sample a perturbation vector $\delta\in \mathbb{R}^{|S|^2}$ from the same distribution\;
        \While{$t\neq T+1$}
        { 
            $C_t=\argmin_{s\in S,k\in [K],c\in \mathcal{C}_{(s,k)}} \delta(s,k)+\sum_{i=1}^{t-1}\ell_t(s_t(c),a_t(c))+\sum_{i=1}^{\max(t,k(c)+1)}\epsilon_i(s_t(c),a_t(c))$\;
        
            Adversary returns loss function $\ell_t$\;
           }
      \caption{FPL algorithm for Deterministic MDPs}
    \label{alg:fpl}
    \end{algorithm}
\subsubsection{Finding the leader: Offline Optimization Algorithm}
First, we design an offline algorithm that finds the  cycle (including start state) with lowest cumulative loss till time $t$ given the sequence of losses $\ell_1\ldots,\ell_{t-1}$. This is the $\argmin$ step in Algorithm~\ref{alg:fpl}. Given $(s,k)$, we find the best  cycle among the cycles that start in state $s$ and have length $k$. For this we use a method similar to that used in \citet{10.5555/3020652.3020666}.
We then  find the minimum over all $(s,k)$ pairs to find the best cycle. Note that we only consider start states $s$ which are in the same cycle class as the start state $s_0$ of the game.

We find the best cycle in $\mathcal{C}_{(s,k)}$ using Linear Programming.
Let $n=|S||A|k$. The LP is in the space $\mathbb{R}^n$.
Consider a cycle $c\in \mathcal{C}_{(s,k)}$. Let $c_i$ denoted the $i_{th}$ state in $c$. Also, let $a_i$ be the action taken at that state.
We associate a vector $x(c)$ with the cycle as follows.
$$x(c)_{s,a,i}=\begin{cases}
    1 & \text{if } a=a_i \text{ and } s=c_i\\
    0 & \text{otherwise}
\end{cases}$$

We construct a loss vector in $\mathbb{R}^{n}$ as follows. 
$$l_{s,a,i}=\sum_{\substack{1\leq j<t\\(j-i)\equiv 0 \mod k}}\ell_{j}(s,a)$$

Our decision set $\mathcal{X}\subseteq \mathbb{R}^n$ is the convex hull of all $x(c)$ where $c\in \mathcal{C}_{(s,k)}$. Our objective is to find $x$ in $\mathcal{X}$ such that $\langle x,l\rangle$ is minimized. The set $\mathcal{X}$ can be captured by the following polynomial sized set of linear constraints.
\begin{align*}
    &x\geq 0\\ &\sum_{a\in A}x_{(s,a,1)}=1\\
    &\forall s'\in S\setminus\{s\},a\in A,\;x_{(s,a,1)}=0\\
    &\forall (s',a')\notin I(s),\;   x_{(s',a',k)}=0\\
    &\forall s'\in S,2\leq i\leq k,\; \sum_{(s',a')\in I(s)}x_{(s',a',i-1)}=\sum_{a\in A}x_{(s,a,i)}
\end{align*}

Once we get an optimal $x$ for the above LP, we can decompose the mixed solution  as a convex combination of at most $n+1$ cycles from Caratheodory Theorem. Also, these cycles can be recovered efficiently (\citet{10.5555/1062374}). Each of them will have same loss and hence we can choose any of them.

Once we have an optimal cycle for a given $(s,k)$, we can minimize over all such pairs to get the optimal cycle. This gives us a polynomial time algorithm to get the optimal cycle.

\begin{remark}
If the new cycle chosen has the same perturbed loss as the old cycle, we will not switch. This is to prevent any unnecessary switches caused by the arbitrary choice of cycle in each optimization step (as we choose an arbitrary cycle with non-zero weight in the solution).
\end{remark}
%\begin{remark}
%Note that $\mathcal{C}_{(s,k)}$ can also contain cycles that don't correspond to deterministic stationary policies. However, arguing a regret upper bound against this larger class is sufficient to prove regret bounds against the class of stationary deterministic policies.
%  \end{remark}

%\subsubsection{Perturbing the losses}


%We sample $|S|$ perturbed loss vectors $\epsilon_i\in \mathbb{R}^{|S||A|}$ for $1\leq i\leq |S|$ by sampling each component independently from an exponential distribution with parameter $\lambda$. We also sample a perturbation vector $\delta\in \mathbb{R}^{|S|^2}$ from the same distribution.  Let the received losses be $\ell_1,\ldots,\ell_T$.


%We now describe the perturbed losses used to find the best cycle in $\mathcal{C}_{(s,k)}$. We set $\ell'_i=\ell_i-\epsilon_i$ for $1<i\leq k$. For $i=1$, we set $\ell' _1(s',a)=\ell_1(s',a)-\epsilon_1(s',a)-\delta(s,k)$. If $k<T$, we set $\ell'_i=\ell_i$ for $i>k$. 

%In the above definition, if $T<k$, we assume $\ell_j=0$ for $j>T$.
%We now solve the optimization problems with these new loss functions $\ell'$. Also, the time $t$ in the optimization will be $\max(T,k+1)$.


\subsubsection{Regret of the FPL algorithm}
\label{sec:fpl_analysis}

We now state the bound on the regret and expected number of switches of Algorithm~\ref{alg:fpl}. 
\begin{theorem}
  \label{first-order-theorem}
  For appropriately chosen $\lambda$, the regret and the expected number of switches of Algorithm~\ref{alg:fpl} can be bounded by $$O\left(|S|\sqrt{L^*\cdot \log |S||A|}\right)$$ where $L^*$ is the cumulative loss of the best cycle in hindsight.
  \end{theorem}

To achieve the desired switching bound, we grouped the policies into polynomial number of groups and showed that probability of the current policy switching to a policy in any of these groups is at most 
$\lambda$. Then, taking a union bound over all the groups, we achieved the desired regret bound. We prove this result in the supplementary section.

\subsection{Putting it together}
We have described the FPL style algorithm
that achieves low regret and low switching. We now use Algorithm~\ref{alg:fpl} as a sub-routine to design a low regret algorithm for the online ADMDP problem. 

Recall that for a cycle $c$, $s_t(c)$ is the state you would reach if you followed the cycle $c$ from the start. This can be computed efficiently.


\begin{algorithm}[]

% Set Function Names
  t=1\;
    $s_0$ is the start state of the environment\;
     Let $c_1$ be the cycle chosen by Algorithm~\ref{alg:fpl} at $t=1$\;
    \If{$s_1\neq s_0(c_1)$}
    {
        Spend $\gamma d$ steps to move to state $s_0(c_1)$\;
        $c_{1+\gamma d}=c_1$\;
        $t=1+\gamma d$\;
    }
    
    \While{$t\neq T+1$}
    { 
       Choose action $a_t=a_t(c_t)$\;
        Adversary returns loss function $\ell_t$ and next state $s_{t+1}$\;
       Feed $\ell_t$ as the loss to Algorithm~\ref{alg:fpl} \;
         \If{Algorithm~\ref{alg:fpl} {switches cycle to } $c_{t+1}$}
        {
        \If{$s_{t+1}\neq s_{t+1}(c_{t+1})$}
    {
        Spend $\gamma d$ steps to move to state $s_{t+\gamma d}(c_{t+1})$\;
        $c_{t+\gamma d}=c_{t+1}$\;
        $t=t+\gamma d$\;

    }
           
        }
        \Else
        {
            
            $t=t+1$\;
        }
      }
  \caption{Low regret algorithm for communicating ADMDPs}
\label{alg:admdp}
\end{algorithm}

We now state the regret bound of Algorithm~\ref{alg:admdp}.
\begin{theorem}
  \label{first-order regret}
  Given a communicating ADMDP with state space $S$, action space $A$ and period $\gamma$, the regret of Algorithm~\ref{alg:admdp} is bounded by
  $$\text{Regret}\leq O\left(|S|^3\cdot \gamma\sqrt{L^*\cdot\log{|S||A|}}\right) $$ where $L^*$ is the total loss incurred by the best stationary deterministic policy in hindsight.
  \end{theorem}
  \begin{proof}
We spend $\gamma d$ steps whenever Algorithm~\ref{alg:fpl} switches. In all other steps, we receive the same loss as the cycle chosen by Algorithm~\ref{alg:fpl}. Thus, the regret differs by at most $\gamma d\cdot N_s$. From Theorem~\ref{first-order-theorem},
we get that the total regret of our algorithm in the deterministic case is  $O\left(|S|\cdot \gamma d\sqrt{T\log{|S||A|}}\right)$ where $d$ is the critical length in the ADMDP. Note that $d$ is at most $O(|S|^2)$ . Thus, we get that $$\text{Regret}\leq O\left(|S|^3\cdot \gamma\sqrt{L^*\cdot\log{|S||A|}}\right) $$ 
  \end{proof}



\begin{remark}
To achieve the first order regret bound, we set $\lambda$ in terms of $L^*$. We need prior knowledge of $L^*$ to directly do this. This can be circumvented by using a doubling trick.
\end{remark}
\subsection{Regret Lower Bound for Deterministic MDPs}
We now state a matching regret lower bound (up to polynomial factors). 
\begin{theorem}
  \label{thm:lower_bound}
For any algorithm $\mathcal{A}$ and any $|S|>3,|A|\geq1$, there exists an MDP $M$  with $|S|$ states and $|A|$ actions and a  sequence of losses $\ell_1,\ldots, \ell_t$ such that $$R(\mathcal{A})\geq \Omega\left(\sqrt{|S|T\log |A|}\right)$$ where $R(\mathcal{A})$ is the regret incurred by $\mathcal{A}$ on $M$ with the given sequence of losses.
\end{theorem}

\section{Stochastic Transitions}

In the previous sections, we only considered deterministic transitions. We now present an algorithm that achieves low regret for the more general class of communicating MDPs (with an additional mild restriction). This algorithm achieves $O(\sqrt{T})$ regret but takes exponential time to run (exponential in $|S|$).


\begin{assumption}
\label{asmptn:loop}
The MDP $M$ has a state $s^*$ and action $a$ such that $$Pr(s_{t+1}=s^*\mid s_t=s^*,a_t=a)=1$$
\end{assumption}
In other words, there is some state $s^*$ in which we have a deterministic action that allows us to stay in the state $s^*$. This can be interpreted as a state with a ``do nothing" action where we can wait before taking the next action.

We now state a theorem that guarantees the existence of a number $\ell^*$ such that all states can be reached from $s^*$ in exactly $\ell^*$ steps with a reasonably high probability.
\begin{theorem}
  \label{clry:critical_length}
  In MDPs satifying Assumption~\ref{asmptn:loop}, we have $\ell^*\leq 2D$ and state $s^*$ such that, for all target states $s'$, we have policies $\pi_{s'}$ such that 
  $$p_{s'}=Pr[T(s'\mid M,\pi_{s'},s^*)=\ell^*]\geq \frac{1}{4D}$$
  \end{theorem}
 

%\begin{remark}
%The policies guaranteed by Theorem~\ref{clry:critical_length} are not stationary.
%\end{remark}
Let $p^*=\min_{s}p_s$. Clearly, $p^*\geq \frac{1}{4D}   $
\subsection{Algorithm}
We extend the algorithm we used in the deterministic MDP case. 

We use a low switching algorithm (FPL) that considers each policy $\pi\in \Pi$ as an expert. We know from \citet{KALAI2005291} that FPL achieves $O(\sqrt{T\log{n}})$ regret as well as switching cost. At time $t$, we receive loss function $\ell_t$ from the adversary. Using this, we construct $\hat{\ell}_t$ as
$$\hat{\ell}_t(\pi)=\mathbb{E}\left[\ell_t(s_t,a_t)\right]$$
where $s_1\sim d_1,a_t\sim \pi(s_t,.)$

In other words, $\hat{\ell}_t(\pi)$ is the expected loss if we follow the policy $\pi$ from the start of the game. $d_1$ is the initial distribution of states.

We feed $\hat{\ell}_t$ as the losses to FPL. 

We can now rewrite  $L^{\pi}$ as 
$$L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_t(s_t,a_t)\right]=\sum_{t=1}^{T}\mathbb{E}\left[\ell_t(s_t,a_t)\right]=\sum_{t=1}^{T}\hat{\ell}_t(\pi)$$ where $s_1\sim d_1$ and $a_t\sim \pi(s_t,.)$. 
Let $\pi_t$ be the policy chosen by $FPL$ at time $t$.
We know that
$$\mathbb{E}\left[\sum_{t=1}^{t}\hat{\ell}_t(\pi_t)\right]-\sum_{t=1}^{t}\hat{\ell}_t(\pi)\leq O(\sqrt{T\log |\Pi|})$$ for any deterministic policy $\pi$.

We need our algorithm to receive loss close to the first term in the above sum. If this is possible, we have an $O(\sqrt{T})$ regret bound for online learning in the MDP. We now present an approach to do this. 
\subsubsection{Catching a policy}
When FPL switches policy, we cannot immediately start receiving the losses of the new policy. If this was possible, then the regret of our algorithm will match that of FPL. When implementing the policy switch in our algorithm, we suffer a delay before starting to incur the losses of the new policy (in an expected sense). Our goal now is to make this delay as small as possible. This coupled with the fact that FPL has a low number of switches will give us good regret bounds. Note that this was easily done in the deterministic case using Theorem~\ref{thm:critical_length_gen}. Theorem~\ref{clry:critical_length} acts somewhat like a stochastic analogue of Theorem~\ref{thm:critical_length_gen} and we use this to reduce the time taken to catch the policy.


\begin{algorithm}

% Set Function Names
  \SetKwFunction{FMain}{Main}
  \SetKwFunction{FSwitch}{Switch\_Policy}
 \DontPrintSemicolon

  \SetKwProg{Fn}{Function}{:}{}
  \Fn{\FSwitch{$s$,$\pi$,$t_0$}}{
        $Done$ = 0\;
        $t=t_0+1$\tcp*{$t_0$ is the time that $B$ switched policy}
        %Sample $T_0,T_1,\ldots,T_{t-1}$ from the Markov Chain induced by target policy $\pi$\;
        $S_{t}=s$\tcp*{$S_{t}$ stores the state at time $t$}
        \While{$Done\neq 1$}
        {
            Move to state $s^{*}$ using the best policy \tcp*{Say this step takes $k$ steps}
           
      
            $t=t+k$\;
             Sample $T_{t+\ell^*}$ from $d_{\pi}^{t+\ell^*}(.)$\;
        
            We set $T_{t+\ell^*}$ as the target state\;
            Use policy $\pi_{T_{t+\ell^*}}$ guaranteed by Corollary~\ref{clry:critical_length} to move $\ell^*$ steps from $s^*$\;
            $t=t+\ell^*$\;
            \If{$S_t=T_t$}
            {
                Consider a Bernouli Random Variable $I$ such that $I=1$ with probability  $\frac{p^*}{p_{S_t}}$.\;
                \If{$I=1$}{
                Start following $\pi$ and set $Done$ to $1$\;
                Let the time at this happens be $T_{switch}$}
                \Else{$I=0$}{
                Continue\;}
            }
            \Else{
            Continue\;
            }
            
            
        }
  }
  \;


  \SetKwProg{Fn}{Function}{:}{\KwRet}
  \Fn{\FMain}{
      Let $\pi_1^\text{FPL}$  be the expert chosen by FPL at time $1$\;
      $\pi_1=\pi_1^\text{FPL}$\;
      Let $S_1$ be the start state.\;
      $t=1$\;
      \While{$t\neq T+1$}
      {
        Sample $a_t$ from $\pi_t(s_t,.)$\;
        Adversary returns loss function $\ell_t$ and next state $s$
        $S_{t+1}$=s\;
        Compute $\hat{\ell}_t$ and feed it as the loss to FPL as discussed before\;
        \If{{FPL switches policy}}
        {
            Switch\_Policy($s,\pi_{t+1}^\text{FPL},t+1$)\tcp*{Call the switch policy function to catch the new policy}
            $\pi_{t+k}=\pi_{t+1}^\text{FPL}$\tcp*{$k$ is the number of steps taking by Switch Policy}
            $t=t+k$\;
           
        }
        \Else
        {
            $\pi_{t+1}=\pi_t$\;
            $t=t+1$\;
        }
      }
  }
  
\caption{Low Regret Algorithm For Communicating MDPs}
\label{alg:alpha}
\end{algorithm}
%\begin{remark}
%In Algorithm~\ref{alg:alpha}, if FPL switches the policy in the middle of the Switch\_Policy's execution, we terminate the execution and call the routine again with a new target policy.
%\end{remark}
\subsection{Analysis}
The following lemma shows that the Switch\_Policy routine works correctly. That is, after the execution of the routine, the state distribution is exactly the same as the state distribution of the new policy. 
\begin{lemma}
\label{lem:switch_dist}
If Switch\_Policy terminates at time $t$, we have that 
$$Pr[S_t=s\mid T_{switch}=t]=d_{\pi}^{t}(s)$$
where $d_{\pi}^{t}(s)$ is the distribution of states after following policy $\pi$ from the start of the game.
\end{lemma}


We now bound the expected loss of the algorithm in the period that FPL chooses policy $\pi$
\begin{lemma}
\label{lem:switch_cost}
Let the policy of FPL be $\pi$ from time $t_1$ to $t_2$. We have that 
$$\mathbb{E}\left[\sum_{t=t_1}^{t_2}\ell_t(s_t,a_t)\right]\leq 48\cdot D^2+\sum_{t=t_1}^{t_2}\hat{\ell}_t(\pi)$$
\end{lemma}

We are now ready to bound the regret of Algorithm~\ref{alg:alpha}
\begin{theorem}
  \label{thm:communicating_regret}
The regret of Algorithm~\ref{alg:alpha} is at most $O\left(D^2\sqrt{T\log|\Pi|}\right)$
\end{theorem}
\begin{proof}
We condition on the number of switches made by FPL. Let $N_s$ be the random variable corresponding to the number of switches made by FPL. We refer to Algorithm~\ref{alg:alpha} as $\mathcal{A}$.
\begin{align*}
    L(\mathcal{A})&=\mathbb{E}\left[\sum_{t=1}^{T}\ell_t(s_t,a_t)\right]\\
    &=\mathbb{E}\left[\mathbb{E}\left[\sum_{t=1}^{T}\ell_t(s_t,a_t)\mid N_s\right]\right]
\end{align*}

After each switch, Lemma~\ref{lem:switch_cost} tells us that the Algorithm suffers at most $48\cdot D^2$ extra average loss to the loss of the algorithm FPL. Thus, 
$$L(\mathcal{A})\leq \mathbb{E}\left[48\cdot D^2\cdot N_s+\sum_{t=1}^{T}\hat{\ell}_t(\pi_t)\right]$$
$\pi_t$ is the policy chosen by algorithm FPL at time $t$.
Since FPL is a low switching algorithm, we have ${N_s}\leq O(\sqrt{T\log |\Pi|}$. The second term in the expectation is atmost $L^\pi+O(\sqrt{T\log|\Pi|})$ for any deterministic policy $\pi$. This is because FPL is a low regret algorithm.
Thus, we have $$L(\mathcal{A})-L^\pi\leq O(D^2\sqrt{T\log|\Pi|})$$ for all stationary $\pi$.

Thus, $R(\mathcal{A})\leq O\left(D^2\sqrt{T\log|\Pi|}\right)$
\end{proof}
When $\Pi$ is the set of stationary deterministic policies, we get that $|\Pi|\leq |A|^{|S|}$. Thus, we get the following theorem.
\begin{theorem}
Given a communicating MDP satisfying Assumption~\ref{asmptn:loop} with $|S|$ states, $|A|$ action and diameter $D$, the regret of Algorithm~\ref{alg:alpha} can be bounded by
$$\text{Regret}\leq O\left(D^2\sqrt{T|S|\log |A|}\right)$$ 
\end{theorem}
In fact, since we are using FPL as the expert algorithm, we can get first-order bounds similar to Theorem~\ref{first-order-theorem}. In a setting with $n$ experts with $m$ being the total loss of the best expert, we can derive that the regret and number of switches can be bounded by $O(\sqrt{m\cdot \log n})$.Thus, using this, we get the following first order regret bounds for Algorithm~\ref{alg:alpha}
\begin{corollary}
Given a communicating MDP satisfying Assumption~\ref{asmptn:loop} with $|S|$ states, $|A|$ action and diameter $D$, the regret of Algorithm~\ref{alg:alpha} can be bounded by
$$\text{Regret}\leq O\left(D^2\sqrt{L^*\cdot |S|\log |A|}\right)$$ where $L^*$ is the total expected loss incurred by the best stationary deterministic policy in hindsight.
\end{corollary}
\subsection{Oracle-efficient algorithm assuming exploring starts}
In this section, we design an oracle-efficient algorithm for the stochastic case assuming that the initial distribution over states, $d_1$ has probability mass at least $\alpha$ on every state. That is, $Pr[S_1=s]\geq\alpha$ for all $s\in S$. 

Here, we assume that we have an oracle $\mathcal{O}$ that can find the stationary deterministic policy with minimum cumulative loss at no computational cost. We use this oracle and the ideas from the previous sections to design an FPL style algorithm low regret algorithm.

Before, proceeding to a detailed discussion of the oracle and the algorithm, we give a few comments on the motivation for designing such algorithms. Oracle efficient reductions are standard for those online problems where the offline problem is computationally challenging. The benefit is in showing that handling the online case does not offer any additional complexity (up to poly factors) over solving the offline problem (where there is no learning, just optimization). In our case, it is not clear if the offline problem (calculating the best stationary problem in hindsight) can be computed efficiently. Theorem 4.12 from \cite{complexity_allender} states that the decision version of the problem is $\mathsf{P}$-hard and in $\mathsf{NP}$, but nothing further is known about the hardness to the best of our knowledge. Any future improvements in the offline problem immediately improves our online algorithm as well, and this is the main advantage of designing an oracle efficient algorithm.



\subsubsection{Oracle}
The oracle $\mathcal{O}$ takes in loss functions $\ell_1,\ldots,\ell_T$ and outputs the stationary deterministic policy with the lowest expected cumulative loss. That is, it returns the policy $\pi=\argmin_{\Pi}L^{\pi}$, where $$L^{\pi}=\mathbb{E}\left[\sum_{t=1}^{T}\ell_t(s_t,a_t)\right]$$ with $s_1\sim d_1$ and $a_t=\pi(s_t)$.

We say that an algorithm is \textit{oracle-efficient} if it runs in polynomial time when given access to the oracle $\mathcal{O}$. We now describe our oracle-efficient algorithm.
\subsubsection{Algorithm}
 
\begin{algorithm}[]

  % Set Function Names
      Sample perturbation vectors $\epsilon\in \mathbb{R}^{|S||A|}$ from an exponential distribution with parameter $\lambda$\;
      \While{$t\neq T+1$}
      { 
          $$\pi_t=\argmin_{\pi\in \Pi} \mathbb{E}\left[\epsilon(s_1,a_1)+\sum_{i=1}^{t-1}\ell_t(s_t,a_t)\right]$$ with $s_1\sim d_1$ and $a_t=\pi(s_t)$\;
      
          Adversary returns loss function $\ell_t$\;
         }
    \caption{Black Box FPL algorithm used for Communicating MDPs with exploring starts}
  \label{alg:fpl_communicating}
  \end{algorithm}

We use Algorithm~{\ref{alg:fpl_communicating}} as our black box experts algorithm. We prove that Algorithm~{\ref{alg:fpl_communicating}} has low regret.
\begin{theorem} \label{thm:fpl_communicating}
  The regret and expected number of switches can be bounded by $ O\left(\sqrt{\frac{L^*|S|\log |S||A|}{\alpha}}\right)$.
\end{theorem}
The rest of the algorithm is the same as Algorithm~{\ref{alg:alpha}}. The exploring starts assumption allows us to get an efficient low regret, low switching algorithm assuming access to the oracle $\mathcal{O}$:
\begin{theorem}\label{thm:communicating_uniform_start}
Given a communicating MDP satisfying Assumption~{\ref{asmptn:loop}} with a start distribution with at least probability $\alpha$ on every state, and given access to the oracle $\mathcal{O}$, we have an efficient algorithm with 
$$Regret\leq O\left(D^2\sqrt{\frac{L^*|S|\log |S||A|}{\alpha}}\right)$$
\end{theorem}
\begin{proof}
  The proof is exactly the same as that of Theorem~{\ref{thm:communicating_regret}} except that we use the switching cost bound from Theorem~{\ref{thm:fpl_communicating}}.
\end{proof}


\section{Conclusion}
We considered learning in a communicating MDP with adversarially chosen costs in the full information setting. We gave an efficient algorithm that achieves $O(\sqrt{T})$ regret when transitions are deterministic. We also presented an inefficient algorithm that achieves a $O(\sqrt{T})$ regret bounds for the general stochastic case with an extra mild assumption. Our result show that in the full information setting there is \emph{no statistical price} (as far as the time dependence is concerned) for the extension from the vanilla online learning with experts problem to the problem of online learning with communicating MDPs.

Several interesting questions still remain open. First, what are the best lower bounds in the general (i.e., not necessarily deterministic) communicating setting? In the deterministic setting, diameter is bounded polynomially by the state space size. This is no longer true in the stochastic case. The best lower bound in terms of diameter and other relevant quantities ($|S|,|A|$ and $T$) still remains to be worked out. Second, is it possible to design an efficient algorithm beyond the deterministic case with fewer assumptions? The source of inefficiency in our algorithm is that we run FPL with each policy as an expert and perturb the losses of each policy independently. It is plausible that an FPL algorithm that perturbs losses (as in the deterministic case) can also be analyzed. However, there are challenges in its analysis as well as in proving that it is computationally efficient. For example, we are not aware of any efficient way to compute the best deterministic policy in hindsight for the general communicating case. This leads us to another open question: are there any oracle-efficient $O(\sqrt{T})$ regret algorithms that do online learning over communicating MDPs. \citet{ftpl_mdp_bandit} give an oracle-efficient $O(T^{5/6})$ regret algorithm but that works for bandits as well and does not use the additional information that is there in the full information case.

\paragraph{Acknowledgements.} AT acknowledges the support of NSF via grant IIS-2007055. Thanks to Csaba Szepesvari and Brian Denton for helpful discussions about the computational complexity of finding the best stationary policy in hindsight. 

% References
\bibliography{chandrasekaran_147}
\end{document}
