\section{Preliminaries and Problem Formulation}
    
\textbf{Partially Observable Markov Decision Processes \quad} We focus on POMDPs with the following definition.

\begin{definition}[POMDP]
    \label{def:pomdp}
    A \emph{Partially Observable Markov Decision Process} (POMDP) is a tuple  $\mathcal{M} = (S, A, O, T, Z, b_0)$, where:
    $S, A,$ and $O$ are finite sets of states, actions, and observations, respectively, 
    $T : S \times A \times S \rightarrow [0,1]$ is the transition probability function,
    $Z : S \times A \times O \rightarrow [0,1]$ is the probabilistic observation function,
    and $b_0 \in \Delta(S)$ is an initial belief, where $\Delta(S)$ is the probability simplex (the set of all probability distributions) over $S$.
\end{definition}
\noindent
We denote the probability distribution over states in $S$ at time $t$ by $b_t$ and the probability of being in state $s \in S$ at time $t$ by $b_t(s)$. 

The evolution of an agent according to a POMDP model is as follows. At each time step $t \in \mathbb{N}_0$, the agent has a belief $b_t$ of its state $s_t$ as a probability distribution over $S$ and takes action $a_t \in A$. Its state evolves from $s_t \in S$ to $s_{t+1} \in S$ according to probability $T(s_t,a_t,s_{t+1})$, and it receives an observation $o_{t} \in O$ according to observation probability
$Z(s_{t+1}, a_t, o_{t})$. The agent then updates its belief recursively. That is for $s_{t+1} = s'$,
\begin{align}
    \label{eq:beliefupdate}
    b_{t+1}(s') \propto Z(s', a_t, o_{t}) \sum_{s \in S} T(s,a_t,s') b_{t}(s).
\end{align}
Then, the process repeats. 

The agent chooses actions according to a \textit{policy} $\pi: \Delta(S) \to A$, which maps a belief $b$ to an action. Typically, the agent is given a reward function $R : S \times A \rightarrow \mathbb{R}$, which is the immediate reward of taking action $a_t$ at state $s_t$ and transitioning to $s_{t+1}$. A POMDP can be reduced to an MDP with an infinite number of states, whose states are the beliefs $B = \{b \in \Delta(S)\}$. This MDP is called a \emph{belief MDP} \citep{Astrom1965beliefmdp}. Let $R(b,a) = \mathbb{E}\left[R(s,a)\right]$. When given a discount factor $\gamma \in [0, 1]$, the \textit{expected discounted-sum of rewards} that the agent receives under policy $\pi$ starting from belief $b_t$ is
\begin{align}
    \label{eq: total reward}
    V^{\pi}(b_t) = \mathbb{E} \Big[ \sum_{j=t}^{\infty} \gamma^{j - t} R\left(b_{j}, \pi(b_{j})\right) \mid b_t, \pi \Big].
\end{align}

The objective of discounted POMDP planning is typically to find a policy that maximizes $V^{\pi}(b_0)$ to some threshold $\epsilon$.

The optimal value function $V^*$ for a POMDP can be under-approximated arbitrarily well by a piecewise linear and convex function \citep{Sondik1978pomdp}, $V^*(b) \geq \max_{\alpha \in \Gamma^*}(\alpha^T b), $ where $\Gamma^*$ is a finite set of $|S|$-dimensional hyperplanes, called $\alpha$-vectors, representing the optimal value function.

\textbf{Trial-based Value Iteration for Discounted POMDPs \quad}

There exists mature literature on algorithms for discounted-sum POMDP problems for discount factor $\gamma < 1$. Among discounted-sum POMDP algorithms that provide finite time convergence guarantees, trial-based heuristic tree search \citep{Smith2005HSVI2, Kurniawati-RSS08-SARSOP, zhang2015please} generally exhibit the best performance. These algorithms typically maintain and refine upper and lower bounds on the value functions, and they explore the reachable belief space through repeated trials over a constructed belief tree.

The basic ingredients of these algorithms are policy representation, action selection, observation/belief selection, backup function, and trial termination criteria. We briefly describe HSVI2~\citep{Smith2005HSVI2}, a trial-based heuristic search algorithm with these basic ingredients.

HSVI2 maintains a set of upper $V^U$ and lower $V^L$ bounds on the optimal value function. Lower bounds are represented as $\alpha$-vectors and upper bounds are computed using an upper bound point set. Trials are conducted in a depth-first manner. At each belief $b_t$, the action with the highest $Q$ upper bound is chosen for expansion using the IE-MAX heuristic:
\begin{align}
    \label{eq:iemax}
    a^* = \arg\max_{a}\{R(b,a) + \mathbb{E}[V^U(b_{t+1})]\}.
\end{align}
Then, an observation is selected by computing the successor belief $b_{t+1}$ with the highest weighted excess uncertainty (WEU): 
\begin{align}
    & \textsc{WEU}(b_t, t, \epsilon) = \big[V^U(b_t) - V^L(b_t) - \epsilon \gamma^{-t}\big],\label{eq: weu} \\
    &o^* = \arg\max_{o}[P(o|b, a) \cdot \textsc{WEU}(b_{t+1}, t+1, \epsilon)]\label{eq:HSVI2heuristic}.
\end{align}
HSVI2 terminates a trial when $V^U(b_t) - V^L(b_t) \leq \epsilon \gamma^{-t}$. After each trial, a Bellman backup is performed over the belief states sampled, improving the lower and upper bound sets. When a discounted ($0 \leq \gamma < 1$) POMDP is given, HSVI2 provably converges to an $\epsilon$-optimal approximation of $V^*$($b_0$) with sound two-sided bounds.