%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023}
% after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables

% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2022} with \usepackage[nohyperref]{icml2022} above.
%\usepackage[colorlinks=true, linkcolor=blue, citecolor=blue, breaklinks=true]{hyperref}


% Attempt to make hyperref and algorithmic work together better:
\usepackage{algorithm,algorithmic}
%\newcommand{\theHalgorithm}{\arabic{algorithm}}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

% if you use cleveref..
\usepackage[capitalize,noabbrev]{cleveref}

% Todonotes is useful during development; simply uncomment the next line
%    and comment out the line below the next line to turn off comments
%\usepackage[disable,textsize=tiny]{todonotes}
%\usepackage[textsize=tiny]{todonotes}

\input{header}

\title{Posterior Sampling-Based Online Learning for \\ the Stochastic Shortest Path Model}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<rahul.jain@usc.edu>?Subject=Your UAI 2023 paper}{Rahul Jain}{}}
\author[1]{Mehdi Jafarnia-Jahromi}
\author[3]{Liyu Chen}
\author[2,3,4]{Rahul Jain}
\author[3]{Haipeng Luo}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Google DeepMind
}
\affil[2]{%
    ECE Department, 
    University of Southern California
}
\affil[3]{%
    CS Department, University of Southern California
  }
\affil[4]{%
    USC Center for Autonomy and AI
  }
%===================================  
\begin{document}
\maketitle


\begin{abstract}
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose \ssp, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the  number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $\otil(\B S\sqrt{AK})$, where $\B$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
\end{abstract}


\input{intro}
\input{preliminaries}
\input{algorithm}
\input{analysis}
\input{experiments}


\section*{Conclusions}

In this paper, we have proposed the first posterior sampling-based reinforcement learning algorithm for the SSP models with unknown transition probabilities. The algorithm is very simple as compared to the optimism-based algorithm proposed for SSP models recently \citep{tarbouriech2020no,rosenberg2020near,cohen2021minimax,tarbouriech2021stochastic}. It achieves a  Bayesian regret bound of $\otil(\B S\sqrt{AK})$, where $\B$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. This has a $\sqrt{S}$ gap from the best known bound for an optimism-based algorithm but numerical experiments suggest a better performance in practice. A next step would be to extend the algorithm to continuous state and action spaces, and to propose model-free algorithms for such settings. Designing posterior sampling-based model-free algorithms for even average MDPs remains an open problem. Another interesting future direction is to extend ideas from \cite{tiapkin2022optimistic} to obtain frequentist regret bound for posterior-sampling based algorithms in the SSP setting.

\paragraph{Acknowledgements}
HL is supported by NSF Award IIS-1943607 and a Google Research Scholar Award. RJ is supported by NSF ECCS-2025732 and ONR N00014-20-1-2258 awards. 

%\newpage
%\bibliographystyle{icml2022}
\bibliography{jafarnia_255-authorship.bib}


%\newpage
%\appendix
%\onecolumn
%
%
%\input{appendix}


\end{document}


% This document was modified from the file originally made available by
% Pat Langley and Andrea Danyluk for ICML-2K. This version was created
% by Iain Murray in 2018, and modified by Alexandre Bouchard in
% 2019 and 2021 and by Csaba Szepesvari, Gang Niu and Sivan Sabato in 2022. 
% Previous contributors include Dan Roy, Lise Getoor and Tobias
% Scheffer, which was slightly modified from the 2010 version by
% Thorsten Joachims & Johannes Fuernkranz, slightly modified from the
% 2009 version by Kiri Wagstaff and Sam Roweis's 2008 version, which is
% slightly modified from Prasad Tadepalli's 2007 version which is a
% lightly changed version of the previous year's version by Andrew
% Moore, which was in turn edited from those of Kristian Kersting and
% Codrina Lauth. Alex Smola contributed to the algorithmic style files.
