% !TEX root = main21neurips-ssp.tex

\section{Preliminaries}
\label{sec: preliminaries}

A Stochastic Shortest Path (SSP) model is denoted by $\calM = (\calS, \calA, c, \theta, \sinit, g)$ where $\calS$ is the state space, $\calA$ is the action space, $c: \calS \times \calA \to [0, 1]$ is the cost function, $\sinit \in \calS$ is the initial state, $g \notin \calS$ is the goal state, and $\theta : \calS^+ \times \calS \times \calA \to [0, 1]$ represents the transition kernel such that $\theta(s' | s, a) = \mathbb{P}(s_t'=s'|s_t=s, a_t=a)$ where $\calS^+ = \calS \cup \{g\}$ includes the goal state as well. Here $s_t \in \calS$ and $a_t \in \calA$ are the state and action at time $t=1, 2, 3, \cdots$ and $s_t' \in \calS^+$ is the subsequent state. We assume that the initial state $\sinit$ is a fixed and known state and $\calS$ and $\calA$ are finite sets with size $S$ and $A$, respectively. A stationary policy is a deterministic map $\pi: \calS \to \calA$ that maps a state to an action. The \textit{value function} (also called the \textit{cost-to-go function}) associated with policy $\pi$ is a function $V^\pi(\cdot;\theta): \calS^+ \to [0, \infty]$ given by $V^{\pi}(g;\theta) = 0$ and $V^{\pi}(s;\theta) := \E[\sum_{t=1}^{\tau_{\pi}(s)}c(s_t, \pi(s_t)) | s_1=s]$ for $s \in \calS$, where $\tau_{\pi}(s)$  is  the number of steps before reaching the goal state (a random variable) if the initial state is $s$ and policy $\pi$ is followed throughout the episode. Here, we use the notation $V^\pi(\cdot;\theta)$ to explicitly show the dependence of the value function on $\theta$. Furthermore, the optimal value function can be defined as $V(s;\theta) = \min_{\pi} V^\pi(s;\theta)$. Policy $\pi$ is called \textit{proper} if the goal state is reached with probability $1$, starting from any initial state and following $\pi$ (i.e., $\max_s \tau_\pi (s) < \infty$ almost surely), otherwise it is called \textit{improper}.

We consider the reinforcement learning problem of an agent interacting with an SSP model $\calM = (\calS, \calA, c, \theta_*, \sinit, g)$ whose transition kernel $\theta_*$ is randomly generated according to the prior distribution $\mu_1$ at the beginning and is then fixed. We will focus on SSP models with transition kernels in the set $\Theta_{\B}$ with the following standard properties:
\begin{assumption}
\label{ass: class of ssp}
For all $\theta \in \Theta_{\B}$, the following holds: (1) there exists a proper policy,
(2) for all improper policies $\pi_\theta$, there exists a state $s \in \calS$, such that $V^{\pi_\theta}(s;\theta) = \infty$, and (3) the optimal value function satisfies $\max_s V(s;\theta) \leq \B$.
\end{assumption}
\citet{bertsekas1991analysis} prove that the first two conditions in Assumption~\ref{ass: class of ssp} imply that for each $\theta \in \Theta_{\B}$, the optimal policy is stationary, deterministic, proper, and can be obtained by the minimizer of the \textit{Bellman optimality equations} given by $V(s;\theta) =$
\begin{align}
\label{eq: Bellman equation}
 \min_a \Big\{c(s, a) + \sum_{s'\in \calS^+}\theta(s'|s, a)V(s';\theta)\Big\}, \forall s \in \calS.
\end{align}
Standard techniques such as Value Iteration and Policy Iteration can be used to compute the optimal policy if the SSP model is known \citep{bertsekas2017dynamic}. Here, we assume that $\calS$, $\calA$, and the cost function $c$ are known (though the algorithm can be extended easily when unknown); and the transition kernel $\theta_*$ is unknown. 
%\textcolor{red}{The algorithm we propose can be extended to the case where $c$ is unknown as well by use of prior developed methods \cite{osband2013more}.} 
Moreover, we assume that the support of the prior distribution $\mu_1$ is a subset of $\Theta_{\B}$.

The agent interacts with the environment in $K$ episodes. Each episode starts from the initial state $\sinit$ and ends at the goal state $g$ (the agent may never reach the goal). At time $t$, the agent observes state $s_t$ and takes action $a_t$. The environment then yields the next state $s_t' \sim \theta_*(\cdot | s_t, a_t)$. If the goal is reached (i.e., $s_t' = g$), then the current episode completes, a new episode starts, and $s_{t+1} = \sinit$. If the goal is not reached (i.e., $s_t' \neq g$), then $s_{t+1}=s_t'$. The goal of the agent is to minimize the expected cumulative cost after $K$ episodes, or equivalently, minimize the \textit{Bayesian regret}:
\begin{align*}
R_K &:= \E\sbr{\sum_{t=1}^{T_K} c(s_t, a_t) - KV(\sinit;\theta_*)},
\end{align*} 
where $T_K$ is the total number of time steps before reaching the goal state for the $K$th time, and $V(\sinit;\theta_*)$ is the optimal value function from \eqref{eq: Bellman equation}. Here, expectation is with respect to the prior distribution $\mu_1$ for $\theta_*$, the horizon $T_K$, the randomness in the state transitions, and the randomness of the algorithm. If the agent does not reach the goal state at any of the episodes (i.e., $T_K = \infty$), we define $R_K = \infty$.

