\section{Introduction}

Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making in which the agent has both transition uncertainty and partial observability \citet{Sondik1978pomdp, Kaelbling1998pomdp}. The goal in a POMDP planning problem is to compute a policy that optimizes for some objective, typically expressed as a reward function or logical specification, e.g., linear temporal logic (LTL) or probabilistic computation tree logic \cite{baier2008principles}. POMDP problems are notoriously hard (due to the curse of dimensionality and history), and finding an optimal policy for them is undecidable \citep{MADANI2003Undecidability}. Hence, to enable tractability, simple objectives are usually considered by either fixing a finite-time horizon or posing discounting.  For such objectives, effective techniques have been developed, which can provide approximate solutions fast~\citep{shani2013survey}. However, a crucial problem in planning under uncertainty and probabilistic model checking is to find a policy that maximizes the probability of reaching a set of target states without knowing \emph{a priori} how many steps it may take. This is known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP)~\citep{de1998formal}. This paper focuses on MRPP for POMDPs and aims to develop an efficient algorithm with optimality guarantees for MRPP.

For infinite horizon discounted problems, point based methods \citep{Pineau2003pbvi, Smith2005HSVI2, Kurniawati-RSS08-SARSOP} approximate the value function by incrementally exploring the space of reachable beliefs. Trial-based belief exploration algorithms such as Heuristic Search Value Iteration (HSVI2) \citep{Smith2005HSVI2} and its extensions have been shown to be the most effective. These methods utilize two-sided bounds to heuristically search for a near-optimal policy via tree search, and can efficiently solve moderately large POMDPs in both finite and discounted infinite-horizon settings to arbitrary precision. However, these techniques have not been studied to address (undiscounted indefinite-horizon) MRPP. Recently, approaches have been proposed to provide under-approximations on reachability probabilities for MRPP \citep{Bork2022underapproximating, Andriushchenko2022InductivePOMDP, andriushchenko2023symbiotic}. Although the under-approximations are shown to be tight empirically, there is no way to ascertain how close the approximations are or whether the computed policy has converged to optimality.

Taking inspiration from the success of trial-based belief exploration for discounted POMDPs and the goal of designing an algorithm that can efficiently attain tight two-sided bound approximations, this paper studies the effectiveness of trial-based belief exploration for POMDPs with MRPP objective. To this end, we analyze the drawbacks of discounted POMDP trial-based search when applied to MRPP, and propose an algorithm that leverages the strength of trial-based belief exploration while addressing these drawbacks. 

Our proposed algorithm is a trial-based belief exploration approach that maintains and improves two-sided bounds on the maximal probability of satisfying a reachability objective. Its use of forward exploration using these bounds allows informed exploration of the relevant areas of the belief space to improve search efficiency. Improving bounds at one belief can also improve bounds at other beliefs. We propose new heuristics for trial-based search tailored to MRPP, and discuss techniques to ensure improvability of both bounds during search. We prove the asymptotic convergence of the policy from below under some conditions. Experimental evaluations show the applicability of our approach to compute tight lower and upper bounds simultaneously that improve over time, converging to the optimal solution for several moderately sized POMDPs. Results show that trial-based exploration allows for efficient search, outperforming state-of-the-art belief-based approaches. Further, our approach is highly competitive, obtaining two-sided bounds that can be tighter than that of existing solutions which generally compute either a lower \emph{or} upper bound, not both. 

In short, the contributions of the paper are: (i) an analysis of theoretical and practical issues when applying discounted-sum algorithms to MRPP, (ii) an efficient algorithm that simultaneously computes sound upper and lower bounds for maximal reachability probabilities of POMDPs in an anytime manner, (iii) proof of asymptotic convergence of the lower bound to the optimal reachability probability value, and (iv) a suite of benchmark comparisons that show our algorithm outperforms existing methods in almost every case both in tightness of the bound and computation time.