
\textbf{Problem Definition \quad}
The problem we are interested in is the undiscounted indefinite-horizon Maximal Reachability Probability Problem (MRPP). The agent is given a POMDP $\mathcal{M}$ with a set of target states $\mathrm{T} \subseteq S$. 
We define the \textit{reachability probability} $P^{\pi}_{\mathcal{M}}(\lozenge \mathrm{T})$ 

to be the probability of reaching $\mathrm{T}$ under policy $\pi$ from an initial belief $b_0$ for the POMDP $\mathcal{M}$. An \textit{optimal policy} that maximizes the reachability probability is: 
$\pi^* = \underset{\pi}{\argsup} P^\pi_{\mathcal{M}}(\lozenge \mathrm{T}).$
\begin{problem}[$\epsilon$-MRPP]
    \label{problem}
    Given a POMDP $\mathcal{M}$, a set of target states $\mathrm{T} \subseteq S$, and a regret bound $\epsilon \in (0, 1]$, find policy $\hat{\pi}$ that is $\epsilon$-optimal, i.e.,
    \begin{align}
        \label{eq: problem}
         P^{\pi^*}_{\mathcal{M}}(\lozenge \mathrm{T}) - P^{\hat{\pi}}_{\mathcal{M}}(\lozenge \mathrm{T}) \leq \epsilon.
    \end{align}
\end{problem}

Probabilistic reachability values can be computed by augmenting the POMDP with an absorbing state $S \cup \{s_\mathrm{T}\}$ and action $A \cup \{a_\mathrm{T}\}$ such that transition probabilities $T(s,a, s') = 1$ if $s \in T\cup \{s_\mathrm{T}\}$, $a = a_\mathrm{T}$, and $s' = s_\mathrm{T}$; otherwise $T(s,a,s') = 0$.  Then, by defining a reward function that assigns a reward of $1$ to the augmented transitions to $s_\mathrm{T}$ and otherwise $0$ (i.e., $R_\mathrm{rp}(s,a) = 1$ if $s\in T$ and $a = a_\mathrm{T}$, otherwise $R_\mathrm{rp}(s,a)=0$) the undiscounted ($\gamma = 1$) expected cumulative reward of a policy is equivalent to its reachability probability \citep{de1998formal}, i.e., for $\gamma = 1$,
\begin{equation}
    \label{eq:probability expected total reward}
    P^{\pi}_{\mathcal{M}}(\lozenge \mathrm{T}) = V^\pi(b_0) = \mathbb{E}\Big[ \sum_{t=0}^\infty \gamma^t R_\mathrm{rp}(b_t,\pi(b_t)) \mid b_0, \pi \Big].
\end{equation}

\begin{remark}
    In many cases, one may want to answer the question of whether there exists a policy that has a reachability probability that exceeds a given threshold. Solutions to Problem~\ref{problem} also allows one to answer such questions.
\end{remark}

\begin{remark}
    An algorithm for MRPP can also be used for problems in which the agent is tasked with maximizing the probability of satisfying temporal logic specifications, such as syntactically co-safe Linear Temporal Logic (cs-LTL)~\citep{kupferman2001modelchecking} or LTL over finite traces (LTLf)~\citep{ltlf}. These objectives can be converted into MRPP by planning in the product space of the POMDP and a Deterministic Finite Automaton (DFA) representation of the temporal logic formula.
\end{remark}