\section{Introduction}
\label{sec:intro}
Stochastic Shortest Path (SSP) model considers the problem of an agent interacting with an environment to reach a predefined goal state while minimizing the cumulative expected cost. Unlike the finite-horizon and discounted Markov Decision Processes (MDPs), in the SSP model,  the horizon of interaction between the agent and the environment depends on the agent's actions, and can possibly be unbounded (if the goal is not reached). A wide variety of goal-oriented control and reinforcement learning (RL) problems such as navigation, game playing, etc. can be formulated as SSP problems. In the RL setting,  where the SSP model is unknown, the agent interacts with the environment in $K$ \textit{episodes}. Each episode begins at a predefined initial state and ends when the agent reaches the goal (note that it might never reach the goal). We consider the setting where the state and action spaces are finite, the cost function is known, but the transition kernel is unknown. The performance of the agent is measured through the notion of \textit{regret}, i.e., the difference between the cumulative cost of the learning algorithm and that of the optimal policy during the $K$ episodes. 

The agent has to balance the well-known trade-off between \textit{exploration} and \textit{exploitation}: should the agent \textit{explore} the environment to gain information for future decisions, or should it \textit{exploit} the current information to minimize the cost? A general way to balance the exploration-exploitation trade-off is to use the \textit{Optimism in the Face of Uncertainty} (OFU) principle \citep{lai1985asymptotically}. The idea is to construct a set of plausible models based on the available information, select the model associated with the minimum cost, and follow the optimal policy with respect to the selected model. This idea is widely used in the RL literature for MDPs (e.g., \citep{jaksch2010near,azar2017minimax,fruit2018efficient,jin2018q,wei2020model,wei2021learning}) and also for SSP models \citep{tarbouriech2020no,rosenberg2020near,rosenberg2020stochastic,chen2021finding,tarbouriech2021stochastic}. 

An alternative fundamental idea to encourage exploration is to use Posterior Sampling (PS) (also known as Thompson Sampling) \citep{thompson1933likelihood}. The idea is to maintain the posterior distribution on the unknown model parameters based on the available information and the prior distribution. PS algorithms usually proceed in \textit{epochs}. In the beginning of an epoch, a model is sampled from the posterior. The actions during the epoch are then selected according to the optimal policy associated with the sampled model. PS algorithms have two main advantages over OFU-type algorithms. First, the prior knowledge of the environment can be incorporated through the prior distribution. Second, PS algorithms have shown superior numerical performance on multi-armed bandit problems \citep{scott2010modern,chapelle2011empirical}, and MDPs \citep{osband2013more,osband2017posterior,ouyang2017learning}.
%\textcolor{red}{In fact, it can be argued easily that a mis-specified prior distribution will only affect the regret as a constant factor.}

The main difficulty in designing PS algorithms is the design of the epochs. In the basic setting of bandit problems, one can simply sample at every time step \citep{chapelle2011empirical}. In finite-horizon MDPs, where the length of an episode is predetermined and fixed, the epochs and episodes coincide, i.e., the agent can sample from the posterior distribution at the beginning of each episode \citep{osband2013more}. Moreover, a bad policy in an episode of a finite-horizon MDP only results in constant regret. However, in the general SSP model, where the length of each episode is not predetermined and can possibly be unbounded, these natural choices for the epoch do not work. This is because sticking to a bad policy in any of the episodes prevents the agent from reaching the goal and imposes infinite regret. Indeed, the agent needs to switch policies during an episode if the current policy cannot reach the goal.

In this paper, we propose \ssp, the first PS-based RL algorithm for the SSP model. \ssp~starts a new epoch based on two criteria. According to the first criterion, a new epoch starts if the number of episodes within the current epoch exceeds that of the previous epoch. The second criterion is triggered when the number of visits to any state-action pair is doubled during an epoch. %, similar to the one used by \cite{bartlett2009regal,jaksch2010near,filippi2010optimism,dann2015sample,ouyang2017learning,rosenberg2020near}.
Intuitively speaking, in the early stages of the interaction between the agent and the environment, the second criterion triggers more often. This criterion is responsible for switching policies during an episode if the current policy cannot reach the goal. In the later stages of the interaction, the first criterion triggers more often and  encourages exploration. We prove a Bayesian regret bound of $\otil(\B S\sqrt{AK})$, where $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\B$ is an upper bound on the expected cost of the optimal policy. 

{Our regret bound is similar to that of \cite{rosenberg2020near} 
%\footnote{The claimed bound in their paper is $\otil(\B^\frac{3}{2} S\sqrt{AK})$ if $\B$ is unknown, however, it can be improved by a factor of $\sqrt{\B}$ following the analysis of Theorem~\ref{thm2}.} 
and has a gap of $\sqrt{S}$ with the lower bound. Note that  \citet{tarbouriech2021stochastic,cohen2021minimax,chen2021implicit} have proposed OFU algorithms that in theory have closed this gap for minimax regret. But as we will see in  Section~\ref{sec: experiments}, the empirical performance of our PS algorithm is much better than that of the OFU algorithms proposed therein. Our algorithm is the \textit{first PS algorithm} for the SSP setting. And as for finite-horizon \citep{osband2013more} and the infinite-horizon average-cost MDPs \citep{ouyang2017learning}, despite a $\sqrt{S}$ gap to the lower bound in theory, PS algorithms significantly outperform the OFU-type algorithms empirically. The $\sqrt{S}$ gap is understood to be an artifact of the analysis and it remains an open question how to bridge it via tighter analysis for PS algorithms in general.}
%have closed the gap via OFU algorithms and reduction to the finite-horizon, respectively. However, the goal of this paper is not to match the minimax regret bound, but rather to introduce the first PS algorithm that has near-optimal regret bound with superior numerical performance than OFU algorithms. This is verified with the experiments in Section~\ref{sec: experiments}. The $\sqrt{S}$ gap exists for the PS algorithms in the finite-horizon \citep{osband2013more} and the infinite-horizon average-cost MDPs \citep{ouyang2017learning} as well. It remains an open question whether it is possible to achieve the lower bound via PS algorithms in these settings.

\noindent The \textbf{main contributions} of this paper are as follows:

\textit{Algorithmic novelty:} A strength of PS algorithms is that their design follows the same general template, and in the infinite-horizon setting, it essentially boils down to the design of the epochs since the rest of the algorithm is natural. This is indeed non-trivial in the SSP setting for three reasons. First, although the SSP model seems closer to the finite-horizon MDPs (as previous OFU algorithms suggest \citep{cohen2021minimax}), applying the PS algorithm of the finite-horizon MDPs \citep{osband2013more} that samples in the beginning of the episodes does not work for the SSP model, because the policy obtained for the sampled transition kernel may not be proper. Second, artificially switching to the fast policy after some time if the current policy does not reach the goal (as in \cite{tarbouriech2020no}), makes the algorithm unnecessarily complicated. Third, applying the PS algorithm of the infinite-horizon average-cost MDPs \citep{ouyang2017learning} to the SSP setting leads to the sub-optimal regret bound of $\order(K^{2/3})$. We propose a simple yet effective epoch design that yields the near-optimal regret bound of $\otil(\B S\sqrt{AK})$. Our epoch is determined based on two criteria. The first criterion encourages exploration by controlling the number of episodes in each epoch. The second criterion controls the number of visits to state-action pairs and is responsible to switch policies if the current policy is not proper.

\textit{Analytical novelty:} In finite-horizon MDPs, the regret of an episode is at-most a constant proportional to the horizon. However, the variable length of the episodes in the SSP setting, imposes a significant challenge in the analysis because there is no upper-bound on the regret of a single episode, let alone $K$ episodes. Therefore, applying direct analysis of previous posterior sampling approaches is not possible. To handle this issue, we have used the notion of “interval” (only in the analysis) to artificially limit the total cost by definition. Then, used concentration bounds, posterior-sampling property, and careful algebraic manipulation to self-bound the total cost $C_M$ after $M$ intervals in terms of $\sqrt{C_M}$. This allows us to show $C_M = \order(\sqrt{M})$ and then translate it in terms of regret. This type of analysis is inspired by \cite{rosenberg2020near} and is not common in previous PS algorithms in finite-horizon/infinite-horizon MDPs. Note that applying the optimism-based analysis of \cite{rosenberg2020near} to the PS setting imposes new challenges that are successfully handled. More specifically, the optimistic transition kernel of \cite{rosenberg2020near} is in the confidence set with high probability. However, in the PS setting, the case where the sampled transition kernel falls outside the confidence set needs to be handled separately (see e.g., how (9) is handled with any-time Bernstein inequality).
Moreover, following Hoeffding-type concentration as in \citet{ouyang2017learning}, yields a sub-optimal regret bound of $\order(K^{2/3})$. Instead, we propose a different analysis using Bernstein-type concentration inspired by the work of \cite{rosenberg2020near} to achieve $\order(\sqrt{K})$ regret bound (see Lemma~\ref{lem: r3}). The new design of the epochs requires a novel analysis in Lemma~\ref{lem: r2} as well.

\textit{Numerical performance:} Our simulations on SSP-MountainCar and two synthetic environments verify that the \ssp~algorithm outperforms the optimism-based competitors significantly, with no hyper-parameter tuning.

%\subsection*{Related Work}
\textbf{Related Work. Posterior Sampling.} The idea of PS algorithms dates back to the pioneering work of \cite{thompson1933likelihood}. The algorithm was ignored for several decades until recently. In the past two decades, PS algorithms have successfully been developed for various settings including multi-armed bandits \cite{scott2010modern,chapelle2011empirical,kaufmann2012thompson,agrawal2012analysis,agrawal2013thompson}, MDPs \citep{strens2000bayesian,osband2013more,fonteneau2013optimistic,gopalan2015thompson,osband2017posterior,kim2017thompson,ouyang2017learning,banjevic2019thompson}, Partially Observable MDPs \citep{jafarnia2021online}, Stochastic Games \citep{jafarnia2021learning}, and Linear Quadratic Control \citep{abeille2017thompson,ouyang2017learningbased}. The  reader is referred to \cite{russo2017tutorial} for a more comprehensive literature review.

\textbf{Online Learning in SSP.} Another related line of work is online learning in the SSP model,  which was introduced by \cite{tarbouriech2020no}. They proposed an algorithm with $\otil(K^{2/3})$ regret bound. Subsequent work of \cite{rosenberg2020near} improved the regret bound to $\otil(\B S\sqrt{AK})$. \cite{cohen2021minimax,tarbouriech2021stochastic,chen2021implicit} proved a minimax regret bound of $\otil(\B\sqrt{SAK})$. However, none of these works propose a PS-type algorithm. We refer the reader to \cite{yin2022offline} for offline learning of the SSP model, \cite{rosenberg2020stochastic,chen2020minimax,chen2021finding} for the SSP model with adversarial costs and \cite{tarbouriech2021sample} for sample complexity of the SSP model with a generative model.

%\textbf{Comparison with \cite{ouyang2017learning}.} Our work is  related to \cite{ouyang2017learning} which proposes \texttt{TSDE}, a PS algorithm for infinite-horizon average-cost MDPs. However, clear distinctions exist both in the algorithm and analysis. From the algorithmic perspective, our first criterion in determining the epoch length is different from \texttt{TSDE}. Note that using the same epochs as \texttt{TSDE} leads to a sub-optimal regret bound of $\order(K^{2/3})$ in the SSP model setting. Moreover, following Hoeffding-type concentration as in \texttt{TSDE}, yields a regret bound of $\order(K^{2/3})$ in the SSP model setting. Instead, we propose a different analysis using Bernstein-type concentration inspired by the work of \cite{rosenberg2020near} to achieve the $\order(\sqrt{K})$ regret bound (see Lemma~\ref{lem: r3}).