
%\documentclass{uai2024}

\documentclass[accepted]{uai2024} % after acceptance, for a revised version;                  
% also before submission to see how the non-anonymous paper would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
% Modern (has noticeable issues)                                       % \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon 
                                          % ptmx; less tested, no support)                  
% NOTE: Only keep *one* line above as appropriate, as it will be replaced                      
%       automatically for papers to be published. Do not make any other                        
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}                                   % \usepackage[british]{babel} 

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}

%% Language setting
%% Replace `english' with e.g. `spanish' to change the document language
%\usepackage[english]{babel}
%%\usepackage{ijcai24}

% Set page size and margins
% Replace `letterpaper' with `a5paper' for UK/EU standard size
%\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

% Useful packages
\usepackage{amsmath}
\usepackage{graphicx}
%\usepackage[colorlinks=true, allcolors=blue]{hyperref}
\usepackage{amsthm}
\usepackage{amsfonts}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{thm-restate}

\theoremstyle{example}
\newtheorem*{example}{Example}
\newtheorem{theorem}{Theorem}
\theoremstyle{definition}
\newtheorem*{definition}{Definition}
\theoremstyle{lemma}
\newtheorem{lemma}{Lemma}
\theoremstyle{proposition}
\newtheorem{proposition}{Proposition}
\theoremstyle{probably}
\newtheorem{probably}{Probably True}

\theoremstyle{corollary}
\newtheorem{corollary}{Corollary}
%\theoremstyle{open}
%\newtheorem{open}{Open Problem}
%\theoremstyle{idea}
%\newtheorem*{idea}{Idea}
%\theoremstyle{question}
%\newtheorem{question}{Question}
%\newenvironment{Proof}{%
%  \renewcommand{\proofname}{Proof}\proof}{\endproof}
% 
%% if needed . . .
%% \renewcommand{\maketitlehooka}{\vbox to 1.75in\bgroup}% was 2.375in

\newenvironment{sketch}{%
  \renewcommand{\proofname}{Proof Sketch}\proof}{\endproof}

\newcommand{\mypara}[1]{\vspace{0pt}\noindent\textbf{#1}~~~}
\newcommand{\ignore}[1]{}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\title{Approximation Algorithms for Observer Aware MDPs}

%\author[1]{\href{mailto:Harry Q. Bovik <harryq@example.edu>?Subject=Your UAI 2021 paper}{Harry~Q.~Bovik}{}} % Lead author
%\author[2]{Coauthor~One}
%\author[1,2]{Coauthor~Two}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
%% Add affiliations after the authors
%\affil[1]{%
%    Computer Science Dept.\\
%    Cranberry University\\
%    Pittsburgh, Pennsylvania, USA
%}
%\affil[2]{%
%    Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{…}
\author[1]{Shuwa Miura}
\author[2]{Olivier Buffet}
\author[1]{Shlomo Zilberstein}

\affil[1]{%
    University of Massachusetts, Amherst, MA, USA
}
\affil[2]{%
Universit\'e de Lorraine, INRIA, CNRS, LORIA, Nancy, France
}


%\author{ {Shuwa Miura  ~~~~~~~~~ Shlomo Zilberstein}\\
%College of Information and Computer Sciences \\  
%University of Massachusetts Amherst\\ 
%\texttt{\{smiura,shlomo\}@umass.edu}
%}

\begin{document}
\maketitle

\begin{abstract}
We present approximation algorithms for Observer-Aware Markov Decision Processes (OAMDPs). OAMDPs model sequential decision-making problems in which rewards depend on the beliefs of an observer about the goals, intentions, or capabilities of the observed agent. The first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer's belief into regular grids. Based on the same discretization, the second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP) called Grid-RTDP. Unlike Grid-VI, Grid-RTDP focuses its updates on promising states using heuristic estimates. 
We provide theoretical guarantees of the proposed algorithms and demonstrate that Grid-RTDP has a good anytime performance comparable to the existing approach without performance guarantees.
\end{abstract}

\section{Introduction}

Effective communication of intentions, goals, and desires is crucial in our daily interactions and is equally vital for autonomous agents. For instance, consider an autonomous vehicle (AV) approaching a crosswalk with a pedestrian nearby. The AV might approach the crosswalk at high speed and then decelerate just in time to avoid hitting the pedestrian. However, this can be unsettling for the pedestrian. A more reassuring approach would be for the AV to slow down well before reaching the crosswalk, signaling its intention to stop. We term such actions that take into account the perspective or beliefs of an observing agent as \emph{observer-aware} behaviors. 
Observer-aware behaviors include making the agent's goal clear \citep{draganGeneratingLegibleMotion2013}, demonstrating its capabilities \citep{kwonExpressingRobotIncapability2018} or disguising possible intentions \citep{mastersDeceptivePathPlanning2017,savasDeceptiveDecisionmakingUncertainty2022}.

The Observer-Aware Markov Decision Process (OAMDP) \citep{miuraUnifyingFrameworkObserveraware2021} offers a general framework for producing observer-aware behaviors.
The OAMDP framework assumes a model of how the agent's actions would be interpreted by the observer. 
In OAMDPs, 
possible goals, intentions, or capabilities of the observed agent are represented as types.
After the observed agent takes an action, the observing agent updates its belief over the possible types, which determines the reward function.
%In OAMDPs, rewards can depend on the observer's beliefs about the agent's types.

While OAMDP allows modeling various observer-aware planning problems in a unified way, solving OAMDPs has been shown to be intractable in the worst case \citep{miuraUnifyingFrameworkObserveraware2021}.
The intractability stems from the fact that rewards depend on the belief of the observer, which in turn depends on the history so far.
Previous work proposed using Monte-Carlo Tree Search (MCTS) to solve OAMDPs for the finite-horizon objective~\citep{miuraUnifyingFrameworkObserveraware2021}.
While MCTS exhibits good anytime behavior,
it does not provide guarantees on the qualities of the resulting policies.

In this paper, we propose the first approximation algorithms for OAMDPs.
We begin by establishing that the domain state and the observer's belief are sufficient statistics in OAMDPs (Proposition~\ref{prop::sufficient_statistics}). 
%Then we discuss assumptions needed for applying approximate algorithms.
%Since the value function in OAMDPs can be discontinuous over the belief of the observer, 
%we identify a subclass of OAMDPs where the value function remains continuous over the observer's belief.
Our first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the belief of the observer into regular grids.
We show that Grid-VI converges to the unique fixpoint both in discounted (Proposition~\ref{prop::grid_vi_convergence_mdp}) and undiscounted (Proposition~\ref{prop::grid_vi_convergence_ssp}) settings under the standard assumptions, and provide error bounds for the discounted setting (Proposition~\ref{prop::grid_vi_error_bound}).
%At each iteration of the algorithm, the values at each grid points are updated, where
%values at other beliefs are linearly interpolated using grid points.
A potential drawback of Grid-VI is that it can waste time updating values at irrelevant states. 
To address the issue, we propose a variant of Real-Time Dynamic Programming (RTDP)~\citep{bartoLearningActUsing1995a} to solve OAMDPs, called Grid-RTDP.
Grid-RTDP utilizes heuristic estimates to focus updates on promising states.
We show that Grid-RTDP retains a key desirable property of RTDP (Proposition~\ref{prop::rtdp_convergence}). Our experimental results indicate that our proposed algorithms are capable of computing near-optimal policies. Specifically, Grid-RTDP solves problems significantly faster than Grid-VI and offers anytime performance comparable to MCTS.

\section{Backgrounds and Notations}
\subsection{Markov Decision Processes}
A finite Markov decision process (MDP) models sequential decision-making under uncertainty. An MDP is described by a tuple 
$M = \langle S, A, T, R, \gamma, d_0 \rangle$. $S$ and $A$ are finite sets of states and actions, respectively. $S_t$ and $A_t$ represent a state and an action at time $t$. $T(s_t, a_t, s_{t+1})$ is the probability of $S_{t+1}{=}s_{t+1}$ when $A_t{=} a_t$ and $S_t{=}s_{t}$. $R$ is a reward for taking $a_t$ at $s_t$.
$\gamma$ is a parameter called the discount factor.
$d_0$ is the initial state distribution $S_0 \sim d_0$.
%The absorbing terminal state always transitions back to itself with zero reward.

A solution to an MDP is called a \emph{policy} ($\pi$).
We use the following two types of policies in the paper. A \emph{stationary policy} is a conditional distribution of actions given a state.
A \emph{history-dependent policy} is a conditional distribution of actions given a history, where a history $h_{t+1}$ is a sequence of state-action pairs up to time $t$ and the last visited state $s_{t+1}$.
An optimal policy for an MDP is a policy that maximizes $\mathbb{E}[\sum_{t=0}^{\infty} \gamma^t R(S_t, A_t)|d_0, \pi]$.
A policy ($\pi$) induces a value function $V^{\pi}(s) = \mathbb{E}[\gamma^t R(S_t, A_t)|S_0=s,\pi]$.
The optimal value function $V^{*}$ is the value function corresponding to an optimal policy.
%For a particular state, a value function $V^{\pi}_{H}$ represents the expected return given a policy $\pi$ up to time step $H$. 
%When $H$ is finite, we call it a value function for a finite horizon.

\subsection{Stochastic Shortest Path Problems}
A \emph{stochastic shortest path problem} (SSP) is an undiscounted, cost-based counterpart of an MDP. An SSP is represented by
 a tuple $\langle S, A, T, C, d_0, G \rangle $ where:
$S$, $A$, and $T$ are the same as in an MDP. $C(s_t,a_t): S \times A \rightarrow \mathbb{R_{+}}$ is the cost of performing $a_t$ at $s_t$.
$d_0$ is the initial state distribution. $G \subset S$ is a set of goal states.
The goal states are absorbing, and transitions out of goal states have zero costs.
%In this paper, we only consider finite sets of states and actions.

A solution of an SSP is a \emph{policy}. 
%A \emph{deterministic policy} $\pi$ maps a state $s$ to an action $a \in A$.
%A \emph{stochastic policy} $\pi$ maps a state $s$ to a probability distribution on $A$.
%A policy $\pi$ induces a value function $V^{\pi}(s) = \mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t)| d_0, \pi]$, which represents the expected cost of reaching a goal state from $s$ by following $\pi$.
An \emph{optimal policy} $\pi^{*}$ is a policy that minimizes 
$\mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t)| d_0, \pi]$.
We restrict our attention to problems in which there exists at least one \emph{proper policy}, which reaches the goal from all states with probability $1$, and any improper policies incur infinite costs.
Under this assumption, an SSP is guaranteed to have an optimal policy that is proper \citep{bertsekasAnalysisStochasticShortest1991}.


\begin{figure*}[t]
    \centering
        \begin{subfigure}[b]{0.40\linewidth}
            \centering
            \includegraphics[height=1.6in]{imgs/baker302_fixed.drawio.png}
            \caption{Environment}
            \label{img::baker5_403}
        \end{subfigure}%
        \begin{subfigure}[b]{0.40\linewidth}
            \centering
            \includegraphics[height=1.6in]{imgs/belief_changes.png}
            \caption{Observer's belief ($\beta=0.3$)}
            \label{img::belief_changes_baker5_403}
        \end{subfigure}
        %\vspace{-8pt}
        \caption{MazeWorld Domain}
        \label{img::mazeworld}
\end{figure*}

\subsection{Observer-Aware MDPs}
Observer-Aware Markov Decision Processes (OAMDPs) extend MDPs by allowing the reward to depend on the observer's assumed belief over the types of the observed agent \citep{miuraUnifyingFrameworkObserveraware2021}.

\begin{definition}
An OAMDP is a tuple\footnote{The original work~\citep{miuraUnifyingFrameworkObserveraware2021} allowed an arbitrary function from $H^{*}$ to $\Delta^{|\Theta|}$ to update the observer's belief. Here, we restrict our attention to a case where the observer updates its belief in a Bayesian fashion.}\\ 
\centerline{~~~$M = \langle S, A, T, \gamma, d_0, \allowbreak \Theta, b_0, \tau, R \rangle$ where:}
\begin{itemize}
    \item $S$, $A$, $T$, $\gamma$, and $d_0$ are the same as in MDPs.
        In this paper, we assume $S$ and $A$ are finite.
	\item  $\Theta$ is a (finite) set of \emph{types}, representing characteristics of the agent, such as possible goals, intentions, or capabilities.
    The types in OAMDPs are analogous to the types in Bayesian game theory~\citep{harsanyiGamesIncompleteInformation1968}.
        \item $b_0 \in \Delta^{|\Theta|}$ is the initial belief of the observer over the types, where
        $\Delta^{|\Theta|}$ is a simplex on $\Theta$.
%	\item $B: H^{*} \rightarrow \Delta^{|\Theta|}$ represents the assumed belief of the observer given a history.
%	$H^{*}$ is the set of all finite histories and $\Delta^{|\Theta|}$ is a simplex on $\Theta$.
        \item $\tau: S \times A \times S \times \Theta \rightarrow [0, 1]$	is the probability of the observer witnessing a transition $\langle s, a, s'\rangle$ given $s$ and $\theta$.
        $\tau$ can represent different policies and transition functions of the observed agent depending on types.
	\item $R : S \times A \times \Delta^{|\Theta|} \rightarrow \mathbb{R}$ is a belief-dependent reward function. 
    In this paper, we assume that the rewards can be represented as a linear combination of \emph{domain} and \emph{belief-dependent} rewards.
    That is, $R(s, a, b) {=} w_d R_d(s, a) + w_b R_b(b)$ for $w_d, w_b {\in} \mathbb{R_{+}}$, where $R_d$ and $R_b$ represent domain and belief-dependent reward, respectively. 
\end{itemize}

After observing a transition $\langle s, a, s' \rangle$,  the observer is assumed to be Bayesian rational and updates its belief ($b_t$) using Bayes' rule:
\begin{equation}
b_{t+1}^{s,a,s'}(\theta) = \frac{ \tau(a, s'|s, \theta) \cdot b_t(\theta) }{ \sum_{\theta' \in \Theta} \tau(a, s'|s, \theta')  \cdot b_t(\theta') }.
\label{eq::belief_update}
\end{equation}

A solution to an OAMDP is a policy that maximizes the expected discounted return:
\begin{equation}
\mathbb{E}[\sum_{t=0}^{\infty} \gamma^t R(S_t,A_t,B_t)|d_0, \pi]. 
\end{equation}
%For an OAMDP $M = \langle S, A, T, \gamma, d_0, \allowbreak \Theta, b_0, \tau, R, \rangle$, we define the corresponding \emph{domain MDP} as
%$M_d = \langle S, A, T, \gamma, d_0, R_d \rangle$.
\end{definition}

Figure~\ref{img::mazeworld} shows an example of an OAMDP with $\Theta= \{\theta_A, \theta_B, \theta_C, \theta_D, \theta_E \}$, where each type corresponds to the observed agent's goal.
$\tau(a, s'|s, \theta)$ is typically set to  
$T^j(s, a, s') \pi_{\theta}(s, a)$, where 
$\pi_{\theta}$ is an assumed policy of the observed agent given a type $\theta$, and
$T^j$ is a transition function according to the observer.

Note that since the observer's belief is not directly accessible to the acting agent, the observer's belief in OAMDPs should be understood as a second-order belief. That is, it is a belief the acting agent believes the observer to have.
%For example, $\pi_{\theta_A}$ represents a policy given the observed agent is going to the goal $A$.
%When $\Theta$ represents different capabilities of the observed agent,
%$T_{\theta}$ represents transition functions corresponding to different capabilities.
%When $T_{\theta}$ is the same for all $\theta \in \Theta$, 
%$\tau(a, s'|s, \theta)$ simplifies to $\pi_{\theta}(s, a)$ in Equation~\ref{eq::belief_update}.

\ignore{
\begin{figure}[h]
    \centering
        \begin{subfigure}[b]{0.6\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/mcts_baker5_403.png}
            \caption{Environment}
            \label{img::baker5_403}
        \end{subfigure}
        \begin{subfigure}[b]{0.6\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/mcts_belief_changes_baker_403.png}
            \caption{Observer's belief ($\beta=0.3$)}
            \label{img::belief_changes_baker5_403}
        \end{subfigure}
        \caption{MazeWorld}
        \label{img::mazeworld}
\end{figure}
}

\mypara{Observer's Belief in Approximate Rationality}
One possible choice for $\pi_{\theta}$ is to assume that the acting agent takes an approximately optimal action at each state given their goals, desires, and intentions:
\begin{equation}
\pi_{\theta}(s, a) \propto \exp^{\beta Q_{\theta}^{*}(s, a)},
\label{eq::noisy_rational}
\end{equation}
where $Q_{\theta}^{*}$ is the optimal Q-value
$Q^{*}(s,a|\theta) = \mathbf{E}[\sum_{t=0}^{\infty} \gamma^t R_t|S_0{=}s,A_0{=}a,\pi^{*}, \theta]$ 
representing how good $a$ is given $s$ and $\theta$. 
Note that $Q_{\theta}^{*}$ is computed with respect to $M_{\theta}=\langle S, A, T_{\theta}, R_{\theta}, \gamma, d_0 \rangle$ defined for each $\theta \in \Theta$. 
$\beta \in \mathbb{R}$ serves as a hyperparameter representing the agent's rationality level. Intuitively, it is assumed that the observed agent selects an action with a probability exponentially proportional to the quality of the action.
Figure~\ref{img::belief_changes_baker5_403} shows the observer's belief changes according to Equation~\ref{eq::belief_update} and \ref{eq::noisy_rational}.
All the goals are equally likely initially. As the agent moves out of the first room, the beliefs in the goals $B$ and $D$ decrease. By the time the agent enters the top right room, the goals $A$ and $E$ are the two most likely goals.


The Bayesian update (Equation~\ref{eq::belief_update}) using
the Boltzmann action model (Equation~\ref{eq::noisy_rational}) is based on the idea that
people often infer goals, desires, and intentions from others' behaviors by assuming that their behaviors are approximately rational given their goals, desires, and intentions \citep{dennettIntentionalStance1987}.
\cite{bakerActionUnderstandingInverse2009} showed that the Bayesian update using Equation~\ref{eq::noisy_rational} largely agrees with human understanding of goals. 

Note that the definition of OAMDPs is not restricted to using the Boltzmann action model as $\pi_{\theta}$. Other possibilities include assuming that the observed agent follows maximum entropy policies\citep{ziebartMaximumEntropyInverse2008} or boundedly rational policies \citep{zhi-xuanOnlineBayesianGoal2020} given its type.

\mypara{Belief-Dependent Rewards}
OAMDP can produce various observer-aware behaviors by changing $R_b$.
For instance, to clarify intentions, $R_b$ might be defined as the negative total variation (TV) or the Euclidean distance between the current and target beliefs, where the target belief is $b(\theta^{*})=1$ for the intended type $\theta^{*} \in \Theta$.
On the other hand, if the observed agent wants to obscure its intention, the reward could be the entropy of the observer's belief.


\mypara{Relationship to POMDPs and I-POMDPs}
While both OAMDPs and partially observable Markov decision processes (POMDPs) operate on the belief of an agent, the two models do not subsume each other. The belief in OAMDP is the second-order belief of the acting agent about the belief of the observer about the type of the acting agent. 
On the other hand, the belief in POMDP is the first-order belief of the acting agent about the states of the world.
Similarly to how beliefs over states are sufficient for optimal control in POMDPs, we will next show that the current state and the belief of the observer are sufficient statistics in OAMDPs.
However, while
most solution methods for POMDPs~\citep{monahanSurveyPartiallyObservable1982,pineauPointbasedValueIteration2003a} rely on 
piecewise linear convexity (PWLC) of the value function,
the value functions for OAMDPs are not necessarily PWLC.
For example, consider using the negative Euclidean distance from the intended type as $R_b$.
$R_b$ is not PWLC on $\Delta^{|\Theta|}$.
Therefore, solution methods for POMDPs are not directly applicable to OAMDPs.

OAMDPs can be seen as a restricted subset of Interactive POMDPs (IPOMDPs) \citep{gmytrasiewiczFrameworkSequentialPlanning2005}, multi-agent extensions to POMDPs, where agents act by recursively modeling the other agents' beliefs.
Several previous works have used IPOMDPs and related multi-agent models to produce observer-aware behaviors\citep{loPlanningPartnerUncertainty2020,alonDisInformationTheory2023}.
Multi-agent formulations are arguably more general and let us reason about what others do in response based on their beliefs.
However, multi-agent formulations are also notoriously hard to solve~\citep{seukenFormalModelsAlgorithms2008}.
\cite{miuraUnifyingFrameworkObserveraware2021} showed that OAMDPs can be seen as a subset of I-POMDPs, where (1) the observer is completely passive, (2) the acting agent knows the observer's type, and (3) the environment is observable to both agents.  
In the following sections, we will see how our proposed solution methods make use of the additional assumptions.


\mypara{Complexity of OAMDPs}
While OAMDPs make restrictive assumptions over I-POMDPs
, computing an optimal policy has been shown to be PSPACE-hard \citep{miuraUnifyingFrameworkObserveraware2021}.
This result suggests that solving OAMDPs is intractable in the worst case.
The reduction used in the proof relies on OAMDPs with discontinuous rewards.
In this paper, we develop approximation algorithms with provable bounds for OAMDPs with Lipschitz-continuous reward and belief transitions.

\subsection{OASSPs}
In this paper, we also consider
OASSPs \citep{lepersHowExhibitMore2024a},
an undiscounted and cost-based version of OAMDPs.
An OASSP is a tuple $\langle S, A, T, d_0, \Theta, b_0, \tau, C, G\rangle$ where $C: S \times A \times \Delta^{|\Theta|} \rightarrow \mathbb{R_{+}}$ is a belief-dependent cost function, and $G$ is a set of goal states.
The other components are the same as in OAMDPs.
An optimal policy for an OASSP is a policy that minimizes: 
\begin{equation}
\mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t, B_t)|d_0, \pi].
\end{equation}
As in OAMDPs, we assume that $C$ is a linear combination of the domain cost ($C_d$) and belief-dependent cost ($C_b$). That is, $C(s, a, b) = w_d C_d(s, a) + w_b C_b(s, a)$ for $w_d,w_b \in \mathbb{R}_{+}$.
A domain SSP corresponding to an OASSP is an SSP defined as $M_d=\langle S, A, T, d_0, C_d, G\rangle$. 

\section{Properties of OAMDPs}
In this section, we discuss properties of OAMDPs necessary for developing the proposed algorithms.
\subsection{Sufficient Statistics}
To compute policies for OAMDPs,
previous work \citep{miuraUnifyingFrameworkObserveraware2021} used a general-purpose method 
%such as AO$^{*}$ \cite{nilsonnilsPrinciplesArtificialIntelligence1980} 
such as UCT \citep{kocsisBanditBasedMonteCarlo2006} to
compute history-dependent policies.
%that do not exploit the structure of OAMDPs.
However,  
we show that
the current state and the belief of the observer contain sufficient information to choose the best action to take:
\begin{proposition}
The current state and the belief of the observer are sufficient statistics in OAMDPs.
\label{prop::sufficient_statistics}
\end{proposition}
\begin{proof}
For all $s_t,s_{t+1} \in S$, $a_t \in A$, $b_t \in \Delta^{|\Theta|}$, $h_t \in H_t$:
\begin{align}
    &\Pr(s_{t+1}, b_{t+1}|s_t, a_t, b_t, h_t) \\ 
    &= \Pr(b_{t+1}|s_t, a_t, s_{t+1}, b_t, h_t)\Pr(s_{t+1}|s_t, a_t, b_t, h_t) \\ 
    &= [b_{t+1} = b_t^{s_t, a_t, s_{t+1}}] T(s_t, a_t, s_{t+1}) \text{ by definition}\\
    &= \Pr(s_{t+1}, b_{t+1}|s_t, a_t, b_t) 
\end{align}
where $[\cdot]$ is the Iverson bracket. Moreover, $R$ only depends on $S_t$, $A_t$, and $B_t$ by definition.
\end{proof}

With Proposition~\ref{prop::sufficient_statistics} in place, we can look for policies of the forms $\pi: S \times \Delta^{|\Theta|} \times A \rightarrow [0, 1]$.
In other words, we can look for policies to \emph{belief MDP}, whose state space is $S \times \Delta^{|\Theta|}$ instead of $S$. 
%$M_b = \langle S \times \Delta^{|\Theta|}, A, T_b, R, \gamma, d_0^b \rangle$ where:
%\begin{align}
%    T_b(\langle s, b \rangle, a, \langle s', b' \rangle)
%    &= [b'=b^{s,a,s'}]T(s, a, s'),\\
%    d_0^b(\langle s, b \rangle)
%    &=[b=b_0]d_0(s).
%\end{align}
%\begin{itemize}
%    \item \[T_b(\langle s, b \rangle, a, \langle s', b' \rangle) = \begin{cases}
%        0 & b' \neq b^{s, a, s'} \\
%        T(s, a, s') & b' = b^{s, a, s'} 
%    \end{cases}\]
%    
%    \item \[d_0^b(\langle s, b \rangle) = \begin{cases}
%        0 & b \neq b_0 \\
%        d_0(s) & b = b_0
%    \end{cases}\]
%\end{itemize}
%corresponding to the original OAMDP, where the set of states is $S \times \Delta^{|\Theta|}$ instead of $S$.
Note that, while the original OAMDP has a finite number of states, the belief MDP has a continuous state space.
%Proposition~\ref{prop::sufficient_statistics}
% is
%analogous to how beliefs over states (belief states) are sufficient for optimal control for POMDPs~\citep{kaelblingPlanningActingPartially1998}.
%
\subsection{Discontinuity in Value Functions}
Before delving into our proposed algorithms, we address a potential issue in developing a value-based approximation algorithm for OAMDPs.
Both of our proposed algorithms approximate values by grouping similar beliefs.
This approach operates under the implicit assumption that nearby beliefs should yield similar values.
However, we demonstrate that, in a general OAMDP, the rate at which the observer's belief changes can be unbounded, thus invalidating this assumption. To illustrate this issue, consider the following example:
%But then, $f: b \mapsto b^{s,a,s'}$ is no standard Lipschitz function.

\begin{example}
Let us assume that we have an OAMDP with:
\begin{itemize}
\item $\Theta=\{\theta_1, \theta_2, \theta_ 3\}$,
\item $b_1=(1-\epsilon,\epsilon, 0) \in \Delta^3$,
\item $b_2=(1-\epsilon, 0, \epsilon) \in \Delta^3$, and
\item $\tau^{s,a,s'} = (\tau_0=0, \tau_1 >0, \tau_2>0)$.
\end{itemize}
Then, $b_1^{s,a,s'} = (0,1,0)$ and  $b_2^{s,a,s'}  = (0,0,1)$. Thus,
\begin{align}
\frac{\|b_1^{s, a, s'}-b_2^{s, a, s'}\|_{\infty}}{\|b_1-b_2\|_{\infty}}
& = \frac{\|(0,1,-1)\|_{\infty}}{\|(0,\epsilon,-\epsilon)\|_{\infty}} = \frac{1}{\epsilon}.
\end{align}
$\frac{\|b_1^{s, a, s'}-b_2^{s, a, s'}\|_{\infty}}{\|b_1-b_2\|_{\infty}}$ diverges as $\epsilon \rightarrow 0$.
\end{example}


\subsection{Lipschitz OAMDPs}
Given the potential discontinuity in values,
we discuss special cases of OAMDPs with Lipschitz-continuous reward and belief transitions.
\begin{definition}
An OAMDP is $(L_r, L_p)$-Lipschitz if for all $s, s' \in S$, $a \in A$, and $b_1, b_2 \in \Delta^{|\Theta|}$:
\begin{align}
|R(s,a,b) - R(s, a, b')| \leq L_r \|b_1 - b_2\|_{\infty}, \\
\|b_1^{s, a, s'} - b_2^{s, a, s'}\|_{\infty}
\leq L_p\|b_1 - b_2\|_{\infty}.
\end{align}
\end{definition}
Intuitively, in Lipschitz OAMDPs, beliefs close to each other have similar rewards and update to close beliefs.
The definition is analogous to Lipschitz continuity of continuous MDPs in general \citep{rachelsonLocalityActionDomination2010}.

Lipschitz continuity of reward and belief transitions can be related to
Lipschitz continuity of the value function under a favorable assumption:
\begin{restatable}{proposition}{proplipschitz}
\label{prop::lipschitz}
For a $(L_r, L_p)$-Lipschitz OAMDP,
if $\gamma L_p < 1$,
then $V^{*}$ is $L_{V^{*}}$-Lipschitz continuous where: \begin{equation}
\label{eq::L_V}
L_{V^{*}} = \frac{L_r}{1 - \gamma L_p}.
\end{equation}
\end{restatable}

\begin{proof}
See Appendix~\ref{sec:proofs}
\end{proof}
As we will see later, Lipschitz continuity of $V^{*}$ enables us to provide the error bound for discretization (Proposition~\ref{prop::grid_vi_error_bound}).
Note that Proposition~\ref{prop::lipschitz} states a sufficient condition for Lipschitz continuity of $V^{*}$. 
In other words, there could be cases where the conditions of Proposition~\ref{prop::lipschitz} are not met, but $V^{*}$ is still Lipschitz.
Moreover, in OAMDPs, belief transitions are assumed to be the Bayesian update using Equation~\ref{eq::belief_update}.
We can establish a relationship between the Lipschitz continuity of belief transitions and $\tau$ as follows:
\begin{restatable}{proposition}{proptau}
\label{prop::tau}
If
$\tau^{s, a, s'}(\theta) > 0 $ for all $\theta \in \Theta$, $s, s' \in S$, and $a \in A$, 
belief transitions are Lipschitz continuous.
\end{restatable}
\begin{proof}
See Appendix~\ref{sec:proofs}
\end{proof}
For example, using the Boltzmann action model (Equation~\ref{eq::noisy_rational}) ensures that  $\tau^{s, a, s'}(\theta) > 0 $, which guarantees the Lipschitz continuity of belief transitions.

\section{Approximation Algorithms}
In this section, we propose approximation algorithms for OAMDPs/SSPs.
Our first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer's belief into regular grids. Our second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP), called Grid-RTDP.
Grid-RTDP relies on the same grid-based discretization scheme as Grid-VI, but focuses its updates on promising states using heuristic estimates. 

\subsection{Grid-Based Value Iteration for OAMDPs/SSPs}
% \item The value function is not necessarily linear (unlike POMDP).

We first describe a grid-based value iteration algorithm for OAMDPs/SSPs.
Grid-VI uses a set of regular grid points to approximate value functions.
A regular grid with the resolution $K$ is defined as:
%Let $K$ be a positive integer representing the resolution of the grid. The regular grid is defined as:
\begin{equation}
P_K = \Big\{ b = (\frac{1}{K}) k | k \in I^{|\Theta|}_+, \sum^{|\Theta|}_{i=1} k(i) = K\Big\},
\end{equation}
where $I^{|\Theta|}_+$ is the set of $|\Theta|$-vectors of non-negative integers.
$P_K$ divides $\Delta^{|\Theta|}$ into a set of equal-size sub-simplices.
Figure~\ref{img::triangulation} shows a sample %an example of a 
regular grid on $\Delta^3$ with $K=2$.

As previously done in grid-based approximation algorithms for POMDPs~\citep{lovejoyComputationallyFeasibleBounds1991},
the value at a given belief point $b \in \Delta^{|\Theta|}$ is interpolated as
using the barycentric coordinates of $b$ with respect to $P_K(b)$:
%a convex combination of values at $P_K(b)$:
\begin{equation}
    V_{K}(s, b) = \sum_{b_i \in P_K(b)} \lambda_i V_K(s, b_i),
    \label{eq::convex_interpolation}
\end{equation}
where $P_K(b)$ is the corners of the sub simplex containing $b$, $\lambda_i \geq 0$, $\sum_{i=1}^{|\Theta|} \lambda_i = 1$, and $b=\sum_{i=1}^{|\Theta|} \lambda_i b_i$.
In Figure~\ref{img::triangulation}, the value at $b$ is interpolated using the values at $g_4$, $g_5$, and $g_6$.
For each iteration,
the algorithm updates values at all $s \in S$ and $b \in P_K$ 
using the Bellman optimality operator ($\mathcal{T}$):
\begin{small}
%\begin{equation}
%\label{eq::bellman}
%(\mathcal{T}V_K)(s, b) =
%\max_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, %a, s'})\big],
%\end{equation}
\begin{multline} \small
\label{eq::bellman}
(\mathcal{T}V_K)(s, b) = \\
\max_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})\big],
\end{multline}
\end{small}
where values at $b \not \in P_K$ are interpolated using Equation~\ref{eq::convex_interpolation}.
The final policy is obtained by one-step lookahead using values at given belief points: 
\begin{small}
%\begin{equation}
%\pi_{K}(s, b) = \argmax_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} %T(s,a,s') V_{K}(s', b^{s, a, s'})\big].
%\label{eq::one_step_lookahead}
%\end{equation}
\begin{multline} \small
\pi_{K}(s, b) = \\
\argmax_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})\big].
\label{eq::one_step_lookahead}
\end{multline}
\end{small}
%The resulting policy is obtained as: 
%\begin{equation}
%\pi_K(s, b, a) = 
%\sum_{b_i \in P_K(b)} \lambda_i [a = \argmax_{a_i \in A} Q_K(s, b_i, a_i)],
%\end{equation}
%where $Q_K(s, b_i, a_i) = R(s, a_i, b_i) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})$.
%That is, we take the optimal actions at the corners of sub-simplices proportional to the corresponding weights $\lambda_i$.

For problems with undiscounted objectives (OASSPs), 
Equation~\ref{eq::bellman} is replaced with minimizing costs without the discount factor.

\subsubsection*{Efficient Interpolation}
\begin{figure}[t!]
        \centering
        \includegraphics[width=\linewidth]{imgs/triangle.pdf}
        \caption{An example of discretized belief points $P_K$ (right) with $K=2$ and $|\Theta|=3$. The left is the corresponding integer points ($P_K'$).}
        \label{img::triangulation}
\end{figure}
One key advantage of using a regular grid is that finding $\lambda$ is quite efficient.
To efficiently find barycentric coordinates of $b\in \Delta^{|\Theta|}$ with respect to ($P_K(b) \subset \Delta^{\Theta}$), we use a Freudenthal triangulation \citep{freudenthalSimplizialzerlegungenBeschrankterFlachheit1942}:
\begin{equation} \small
P_K' = \Big\{ q \in I^{|\Theta|}_+| K=q(1) \geq q(2) \geq \cdots \geq q({|\Theta|}) \Big\}.
\end{equation}
Note that, we have $|P_K'|=|P_K|=\frac{(K + |\Theta| - 1)!}{K! (|\Theta|-1)!}$.
Due to one-to-one correspondence between points in $P_K$ and $P_K'$, we can find a barycentric coordinate for 
$b\in \Delta^{|\Theta|}$ using a barycentric coordinate for the corresponding $v \in I^{|\Theta|}_+$ \citep{lovejoyComputationallyFeasibleBounds1991,zhouImprovedGridbasedApproximation2001} as follows:

\begin{enumerate}
    \item Given $b \in \Delta^{|\Theta|}$, let $x(i)=K \sum_{j=i}^{|\Theta|} b(\theta_j)$.
    For example, given $b=(0.4, 0.4, 0.2)$, we have $x = (2.0, 1.2, 0.4)$.
    \item Let $v(i)$ be the largest integer such that $v(i) \leq x(i)$.
    In our example, $v=(2, 1, 0)$.
    \item Let $d(i) = x(i) - v(i)$.
    In our example, $d=(0.0, 0.2, 0.4)$.
    \item Let $p$ be a permutation of $1 \cdots |\Theta|$ such that $d(p(1)) \geq d(p(2)) \geq \cdots \geq d(p(|\Theta)|))$.
    In our example, $p=(2, 1, 0)$.
    \item Identify the vertices ($v_1,v_2,\cdots,v_{|\Theta|}$) of the subsimplex in $P'_K$ containing $x$ as follows:
    \begin{align}
        v_1 &= v, \\
        v_{j+1}(i) &= \begin{cases}
                    v_j(i) + 1 & \text{ if $i=p(j)$},\\
                    v_j(i) & \text{ otherwise}.
               \end{cases}
    \end{align}
    In our example, $v_1=q_4=(2, 1, 0)$, $v_2=q_5=(2, 1, 1)$, and $v_3=q_6=(2, 2, 1)$.
    Because of the one-to-one correspondence between $P_K$ and $P_K'$, this identifies the corresponding points in $P_K$ ($b_1,\cdots,b_{|\Theta|}$) containing $b$.
    In our example, $b_1=g_4=(0.5, 0.5, 0.0)$, $b_2=g_5=(0.5, 0.0, 0.5)$, and $b_3=g_6=(0.0,0.5,0.5)$.
    \item The barycentric coordinates $\lambda_1,\cdots,\lambda_{|\Theta|}$ are determined as:
    \begin{align}
        \lambda_i &= d(p(i-1)) - d(p(i)) \text{ for $2 \leq i \leq |\Theta|$},\\
        \lambda_1 &= 1 - \sum_{i=2} \lambda_i.
    \end{align}
    In our example, $\lambda_1=0.6$, $\lambda_2=0.2$, and $\lambda_3=0.2$.
\end{enumerate}
As discussed by \cite{zhouImprovedGridbasedApproximation2001}, finding a sub-simplex can be done in $\mathcal{O}(|\Theta| \log |\Theta|)$ time.

\subsubsection*{Theoretical Guarantees}
We now discuss the theoretical guarantees of Grid-VI.
Our first result shows that Grid-VI converges to the unique fixpoint in the discounted setting.
\begin{proposition}
\label{prop::grid_vi_convergence_mdp}
For an OAMDP,
Grid-VI converges to the unique fixpoint $V_K^{*}$.
\end{proposition}
\begin{proof}
The interpolation (Equation~\ref{eq::convex_interpolation}) can be understood as an operator on the value function.
Let $\mathcal{A}_K$ be the corresponding operator, then our Grid-VI can be seen as repeatedly applying $(\mathcal{T}_{K} = \mathcal{A}_K \circ \mathcal{T})$
to the value function.
Since $\mathcal{A}_K$ is a nonexpansion and $\mathcal{T}$ is a contraction, 
$\mathcal{A}_K \circ \mathcal{T}$ is also a contraction, and Grid-VI converges to the unique fixpoint $V^{*}_K$ \citep{gordonStableFunctionApproximation1995}.
\end{proof}

The next result establishes the error bound for the approximate value function $V^{*}_K$ from the optimal value function $V^{*}$ in the discounted setting.
We first prove
Lemma~\ref{lemma::one_step}, which bounds the one-step error due to approximation.
%We know bound the error be
\begin{restatable}{lemma}{lemmaonestep}
\label{lemma::one_step}
For an OAMDP with Lipschitz-continuous value function with the constant $L_{V^{*}}$, one-step approximation errors using a regular grid with resolution $K$ are bounded as:
\begin{equation}
\|\mathcal{T}_K V^{*}- V^{*}\|_{\infty} \leq 
\frac{L_{V^{*}}}{K}.
\end{equation}
\end{restatable}
\begin{proof}
See Section~\ref{sec:proofs}.
\end{proof}

In the discounted case, the overall value approximation error can be bounded using the one-step approximation error.
\begin{restatable}{proposition}{properrorbound}
\label{prop::grid_vi_error_bound}
For an OAMDP whose value function is $L_{V^{*}}$-Lipschitz continuous,
 we have:
\begin{equation}
\|V^{*} - V_K^{*}\|_{\infty} \leq 
\frac{L_{V^{*}}}{(1-\gamma)K} .
\end{equation}
\end{restatable}

\begin{proof}
See Section~\ref{sec:proofs}.
\end{proof}
Note that the right-hand sides go to $0$ as $K \rightarrow \infty$.

Next, we show that under the standard assumptions, Grid-VI converges to the unique fixpoint when it is applied to undiscounted problems (OASSPs) as well. 
To prove the claim,
we first note that,
for an OASSP $M=\langle S, A, T, d_0, \Theta, b_0, \tau, C, G\rangle$,
Grid-VI for OASSPs implicitly defines an SSP $M_K = \langle S \times P_K, A, T, d_0^K, C_K, G_K \rangle$ where%
% no blank line here
\begin{small}
\begin{align}
    T_K(\langle s, b \rangle, a, \langle s', b_i \rangle) &= \begin{cases} 
        0 & b_i \not \in P_K(b^{s, a, s'}),  \\
        \lambda_i T(s, a, s') & b^{s, a, s'} = \sum_i \lambda_i b_i,
    \end{cases} \\
    d_0^K(\langle s, b_i\rangle) &= 
    \begin{cases}
        0 & b_i \not \in P_K(b_0), \\
        \lambda_i d_0(s) & b_0 = \sum_i \lambda_i b_i,
    \end{cases}\\
    C_K(s, a, b) &= C(s, a ,b),\\
    G_K &= G \times P_K.
\end{align}
\end{small}

The states in $M_K$ consist only of the corners of sub-simplices.
The transitions in $M_K$ are the same as in the original OASSP, except that, after the belief update, $b^{s, a, s'}$ is transitioned to one of the belief points $b_i \in P_K(b^{s, a, s'})$ surrounding it.
Note that, unlike belief MDPs corresponding to OAMDPs/SSPs, the number of belief states in $M_K$ is finite.


Since all $M$, $M_d$ and $M_K$ have the same dynamics in terms of domain state transitions, we have:
\begin{lemma}
    If $M_d$ has a proper policy, $M$ and $M_K$ also have at least one proper policy.
    \label{lemma::proper_policy}
\end{lemma}
\begin{proof}
Let $\pi_d$ be a proper policy for $M_d$.
Then $\pi(\langle s, b\rangle, a) = \pi_K(\langle s, b\rangle, a) = \pi_d(s, a)$ are proper policies for $M$ and $M_K$, respectively.
\end{proof}

The fact that $M_K$ has a proper policy whenever $M_d$ has one lets us prove the convergence of Grid-VI.
\begin{proposition}
\label{prop::grid_vi_convergence_ssp}
If $M_d$ has a proper policy,
Grid-VI for OASSPs converges to the unique fixpoint $V^{*}_K$.
\end{proposition}
\begin{proof}
From Lemma~\ref{lemma::proper_policy}, $M_K$ has a proper policy when $M_d$ has one.
Our definition of OASSPs only allows positive costs. Therefore, any improper policies trivially incur infinite costs.
For an SSP with a finite number of states, 
value iteration converges to the unique fixpoint as long as there is a proper policy and any improper policy incur infinite cost ~\citep{bertsekasAnalysisStochasticShortest1991}. 
\end{proof}

\mypara{Relationships to Grid-based Approximation Algorithms for POMDPs}
Our algorithm shares similarities with grid-based approximations for POMDPs \citep{lovejoyComputationallyFeasibleBounds1991,brafmanHeuristicVariableGrid1997,hauskrechtValuefunctionApproximationsPartially2000,zhouImprovedGridbasedApproximation2001,bonetEpsilonOptimalGridBasedAlgorithm2002}. 
The main difference is that the belief is over $\Theta$ in OAMDPs/SSPs instead of over $S$ as in POMDPs.
Approximation using regular grids requires the number of points exponential to the dimension of belief vectors.
%This exponential requirement has been a major hurdle in applying grid-based approximation algorithms to real-world POMDPs, where the number of underlying states can be very large.
However, in most scenarios, it is reasonable to assume that the number of possible types
($|\Theta|$) is much smaller than the number of states.
Thus, having grid points exponential to the dimension of belief vectors is less of a constraint for OAMDPs/SSPs.
%Secondly, grid points are needed for each
%$s \in S$ in OAMDPs, as the beliefs do not contain information about the domain states. 
%This necessitates a minor modification to the existing algorithms. 
%This requires a slight modification to the existing algorithms. Note that we can alternatively use $(|S| + |\Theta|)$-dimension belief vectors where the beliefs on $S$ is either $1$ or $0$.

\mypara{Relationships to Grid-based Approximation Algorithms for Continuous MDPs}
Our Grid-VI for OAMDPs/SSPs is a special case of grid-based value iteration for 
continuous MDPs  \citep{chowOptimalOnewayMultigrid1991,munosVariableResolutionDiscretization2002} in general.
One main difference is that, in OAMDPs/SSPs, the continuous part of the state space ($\Delta^{|\Theta|}$) is guaranteed to be a simplex, which enables the efficient interpolation method.
Another difference is that, due to the structure of OAMDPs/SSPs, discretization preserves the existence of a proper policy (Lemma~\ref{lemma::proper_policy}), which helped us prove the convergence of Grid-VI for the undiscounted setting.

\subsection{Grid-based Real-Time Dynamic Programming for OAMDPs/SSPs}
We now propose an extension of Real-Time Dynamic Programming (RTDP) \citep{bartoLearningActUsing1995a} to OAMDPs/SSPs, called Grid-RTDP.
The potential issue for Grid-VI is that it needs to update values at every state and grid points.
However, many of these points could be irrelevant in computing an optimal policy.
RTDP is an asynchronous value iteration algorithm that can converge to the optimal solution without having to consider the entire state space.
RTDP avoids exploring a portion of the state space by utilizing an admissible heuristic, i.e. lower bounds for the expected costs to the goal.
While Grid-RTDP could be applied to OAMDPs,
our presentation in this section will be based on OASSPs.

%To adapt RTDP to solve OAMDPs, we use a simple discretization scheme ($d$) over beliefs over types:
%\begin{equation}
%d_K(b) = \left\lceil K \cdot b \right\rceil
%\end{equation}
%where $K$ is a positive integer representing the resolution of discretization, 
%and $\left\lceil \right\rceil$ is a ceiling function.
%For example,
%The discretization scheme is identical to the one used for adapting RTDP to POAMDPs, except that the beliefs are over $S$ in POMDPs but over $\Theta$ in OAMDPs.
%

%\mypara{RTDP for OASSPs}
Similar to Grid-VI,
Grid-RTDP discretizes beliefs into regular grids.
The value at a belief $b \in \Delta^{|\Theta|}$ is interpolated using Equation~\ref{eq::convex_interpolation}.
Algorithm~\ref{algorithm::rtdp} shows a pseudocode for Grid-RTDP.
The algorithm consists of repeated trials, where each trial starts from the initial state and belief of the observer.
During each trial, the algorithm first maps the current belief $b$ to one of the surrounding grid points $b_i \in P_K(b)$ randomly, where $b=\sum_{i=1}^{|\Theta|} \lambda_i b_i$.
Each $b_i$ has probability $\lambda_i$ of transitioning into (line~\ref{line::discretize}).
Then the algorithm selects an action that minimizes the current cost estimate to the goal $Q_K(s, b_i, a)$ (line~\ref{line::greedy}):
\begin{small}
\begin{align}
&Q_K(s, b, a) \\
&= C(s, a, b) + \sum_{s' \in S} T(s, a, s') V_K(s', b^{s, a, s'})\\
&= C(s, a, b) + \sum_{s' \in S} T(s, a, s') \sum_{b_i \in P_K(b^{s, a, s'})} \lambda_i V_K(s', b_i),
\end{align}
\end{small}
where $V_K$ is initialized with a given heuristic function $h$.
%There could be many possible heuristic functions.
In this paper, we consider the following two heuristic functions:
\begin{itemize}
\item $h_{0}$: which always returns $0$ (i.e. no heuristics), and
\item $h_{d}$: which returns the scaled optimal cost to go for the underlying domain cost ($w_d \cdot V_d^{*}(s)$).
\end{itemize}
Note that both $h_0$ and $h_d$ are admissible heuristics.
After selecting the best action $a^{*}$, the cost estimate for the current state ($V_K(s, b_i)$) is updated to  $Q_K(s, b_i, a^{*})$ (line~\ref{line::value_update}). Note that the values are updated only at beliefs in $P_k$.
The next state is then sampled according to the dynamics of the environment (line~\ref{line::next_state}) and the belief of the observer is updated accordingly (line~\ref{line::belief_update}).
%The resulting policy is obtained by one-step lookahead: 
%\begin{equation}
%\pi_K(s, b) = \argmin_{a \in A} Q_K(s, b, a).
%\end{equation}
\begin{algorithm}[t]
    \caption{Grid-RTDP}
    \begin{algorithmic}[1]
    \Function{Grid-RTDP}{}
        \While{within computational budget}
            \State $TRIAL(d_0, b_0)$
        \EndWhile
    \EndFunction
    \\
    \Function{Trial}{$d_0$, $b_0$}
        \State $s \sim d_0$
        \State $b \gets b_0$
        \While{episode continues}
            \State \textcolor{red}{sample $b_i \in P_K(b)$ with the weight $\lambda_i$} \label{line::discretize}
            \State $a^{*} \leftarrow  \min_a Q_K(s, b_i, a)$ \label{line::greedy}
            \State $V_K(s, b_i) \gets Q_K(s, b_i, a^{*})$ \label{line::value_update}
            \State $s' \sim \Pr(\cdot|s, a^{*})$ \label{line::next_state}
            \State $b \gets b_i^{s, a, s'}$ \label{line::belief_update}
        \EndWhile
    \EndFunction
    \end{algorithmic}
    \label{algorithm::rtdp}
\end{algorithm}

\mypara{Relationships to RTDP-Bel}
Grid-RTDP is akin to RTDP-Bel \citep{bonetSolvingPOMDPsRTDPbel2009}, 
a version of RTDP developed for POMDPs.
Similar to Grid-RTDP,
RTDP-Bel is based on discretizing beliefs.
Let $d(b)$ be a discretization of $b$.
Unlike Grid-RTDP that updates the value at $d(b)$ using Q-values at $d(b)$,
RTDP-Bel updates the value at $d(b)$ using Q-values at $b$.
This may cause an oscillating behavior of RTDP-Bel when two different belief points $b_1$ and $b_2$ discretize to the same point (i.e., $d(b_1) = d(b_2))$.
%%% This can be a problem when two different belief points $b_1$ and $b_2$ discretize to the same point ($d(b_1) = d(b_2))$, resulting in RTDP-Bel's oscillating behavior.
%However, RTDP-Bel does not have a convergence guarantee and may oscillate. This is because if two different belief points $b_1$ and $b_2$ are discretized to the same grid point, to update the value at  

\mypara{Properties}
We discuss some properties of Grid-RTDP.
When applied to SSPs, RTDP has the following guarantee:
%\begin{theorem}[\cite{bertsekasNeuroDynamicProgramming1996}]
%    If there exists a proper policy for an SSP $M$, every trial in RTDP for $M$ terminates in a finite number of steps.
%    \label{th::rtdp_finite_step}
%\end{theorem}

\begin{theorem}[\cite{bartoLearningActUsing1995a}]
    If there exists a proper policy for an SSP and the initial value is admissible, then RTDP converges to the optimal value at relevant states.
    \label{th::rtdp_convergence}
\end{theorem}

We will now show that Grid-RTDP inherits the properties analogous to Theorem~\ref{th::rtdp_convergence} under the following conditions:
\begin{itemize}
    \item[A1] The domain SSP $M_d$ has a proper policy.
    \item[A2] The initial value estimates are admissible.
\end{itemize}


Combining Lemma~\ref{lemma::proper_policy} with Theorem~\ref{th::rtdp_convergence}, we get:
%\begin{proposition}
%    Under A1, every trial in RTDP for OAMDP terminates in a finite number of steps.
%    \label{prop::rtdp_finite_step}
%\end{proposition}

\begin{proposition}
    Under A1-2, Grid-RTDP converges to the optimal values ($V_K^{*}$) at relevant states.
    \label{prop::rtdp_convergence}
\end{proposition}
Note that Proposition~\ref{prop::rtdp_convergence} proves the convergence to the optimal value for discretized problems $V_K^{*}$, but not to $V^{*}$.

\subsubsection{Grid-Based Labeled RTDP for OASSPs}
We now propose labeled RTDP (LRTDP) \citep{bonetSolvingStochasticShortestPath2002} for OASSPs, called Grid-LRTDP.
The original
RTDP does not explicitly check for convergence, and can keep visiting states that are already solved, resulting in its slow convergence behavior.
LRTDP alleviates the issue by labeling these states as solved.
%Similar to RTDP, LRTDP for OAMDP uses a regular grid, and operates by mapping a belief $b$ to its neighbors $P_K(b)$.
The algorithm labels states as solved if residuals of Bellman updates in the states that could be visited under the current best policy are smaller than a given threshold.

Grid-LRTDP simply adds the LRTDP labeling procedure to Grid-RTDP.
Alternatively, Grid-LRTDP can be understood as applying LRTDP to a discretized problem $M_K$.
The pseudocode for the algorithm is available in the appendix (Appendix~\ref{sec::lrtdp_pseudocode}).

The final policy for Grid-(L)RTDP is obtained as: 
\begin{equation}
\pi_K(s, b, a) = 
\sum_{b_i \in P_K(b)} \lambda_i [a = \argmin_{a_i \in A} Q_K(s, b_i, a_i)].
\end{equation}
This means we take the optimal actions at the corners of subsimplices, proportional to the corresponding weights $\lambda_i$. The reason we do not use the one-step lookahead policy (Equation~\ref{eq::one_step_lookahead}) is that it could lead us to regions of beliefs where the values have not yet converged. When an action is selected at $b$ such that it is not optimal at any $b_i \in P_K(b)$, the updated belief might not have been checked for convergence by Grid-LRTDP.

\begin{figure*}[t!]
        \centering
            \centering
            \includegraphics[width=\linewidth]{imgs/legible_disimulation_blocks_world.pdf}
            \caption{Task optimal behavior (top) and legible behavior (bottom) in Blocks World from \citep{miuraUnifyingFrameworkObserveraware2021} Red blocks are the ones
the agent is holding. Blue blocks represent blocks that were just put down. The possible goals are “ARMS” or “RAMS”.}
            \label{img::blocks_world}
\end{figure*}

\begin{figure*}[t!]
        \centering
            \centering
            \includegraphics[width=\linewidth]{imgs/spelling_arms.pdf}
            \caption{Acronym}
            \label{img::acronym}
\end{figure*}

\section{Experiments}
We present experimental results solving OASSPs using the proposed algorithms.
In the first experiment, we compare Grid-VI and Grid-LRTDP on the time until values are $\epsilon$-consistent (the maximum residual is smaller than a given threshold).
In the second experiment, we compare Grid-(L)RTDP and UCT in terms of their anytime performances.
All the codes used in the experiments are available from \url{https://github.com/dosydon/approximation_algorithm_for_oamdp}.

\subsection{Domains}
We briefly describe the problem domains used in the experiments.
%\begin{figure}[t!]
%        \centering
%        \begin{subfigure}[b]{0.3\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/blocks_world.drawio.png}
%            \caption{BlocksWorld}
%            \label{img::blocks_world}
%        \end{subfigure}
%        \hspace{10pt}
%        \begin{subfigure}[b]{0.32\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/spelling.pdf}
%            \caption{Acronym}
%            \label{img::acronym}
%        \end{subfigure}
%        \caption{Problems}
%\end{figure}

\mypara{MazeWorld}
Figure~\ref{img::baker5_403} shows an example of MazeWorld.
%where the agent can take 9 different actions: \emph{Stay, North, South, East, West, NorthEast, NorthWest, SouthEast} and \emph{SouthWest}. 
The agent's goal is to reach either one of the possible goals
$\{A,B, C, D, E\}$.
The domain costs are proportional to the distance traveled.
To encourage being clear about the intention, $C_b$ is the TV distance from the target belief.
To make the problem more challenging, the agent can get transported to the initial state with  probability $0.1$ at each state.
$w_d = 0.1$ and $w_b = 1.0$.

\mypara{BlocksWorld}
Figure~\ref{img::blocks_world} shows an example of Blocksworld from \cite{miuraUnifyingFrameworkObserveraware2021}, where the goal is to stack blocks to spell ``ARMS". Picking up a block always succeeds with probability $1$, while
putting down a block fails with probability $0.3$ (the block falls on the table).
Each domain action has a cost of $1$. 
$C_b$ is the TV distance from the target belief.
$w_d = 0.1$ and $w_b = 1.0$.
The optimal policy first stacks ``R" on top of ``S". This is not optimal in terms of task progression, but tells the observer that the goal ``ARMS" is more likely than ``RAMS".

%\mypara{MazeWorld with Explicit Communication}
%Figure shows an example of MazeWorld,
%where the agent can take 9 different actions: \emph{Stay, North, South, East, West, NorthEast, NorthWest, SouthEast} and \emph{SouthWest}. 
%The agent's goal is to reach either one of the possible goals
%$\{A,B, C\}$.
%The rewards for the actions are proportional to the negative distance traveled by the action.
%
\mypara{Acronym}
Figure~\ref{img::acronym} illustrates the Acronym domain.
There are four locations with letters.
The agent can move in eight different directions.
Once the agent is in the locations with letters, it can toggle the letters among $A \rightarrow M \rightarrow R \rightarrow S \rightarrow A$.
The potential goals are to spell ``ARMS", ``RAMS", or ``MARS" from top left to bottom right.
When toggling among letters,
there is $0.3$ probability of accidentally toggling too much.
The objective is to spell ''ARMS" while being ambiguous about the intention. 
$C_b(b) = H_{max} - H(b)$ where $H_{max}$ is the entropy of the uniform distribution and $H(b)$ represents the entropy of $b$.
$w_d = 0.5$ and $w_b = 1.0$.

\subsection{Computing $Q^{*}_{\theta}$}
Using the Boltzmann action model (Equation~\ref{eq::noisy_rational}) for the belief update (Equation~\ref{eq::belief_update}) requires computing $Q^{*}_{\theta}$ at each state for each $\theta \in \Theta$.
For Grid-VI, we used Value Iteration to compute $Q^{*}$ since Grid-VI needs to enumerate all states.
For Grid-LRTDP and UCT, 
we used LRTDP from $s \in S$ as needed to compute $Q^{*}_{\theta}(s, a)$ to avoid generating the entire state space.
The running times for each algorithm include the running times for computing $Q^{*}_{\theta}$.

\subsection{Offline Convergence}
We compare the following algorithms on the time before the maximum residual is smaller than $\epsilon=10^{-3}$:
\begin{itemize}
\item Grid-VI with $K=1, 4, 8$;
\item Grid-LRTDP with $K=1, 4, 8$ using $h_0$ and $h_d$.
\end{itemize}
Each run has time limit $10$m and memory limit $8$Gbytes.

Table~\ref{tab:covergence} shows the results.
Grid-LRTDP using $h_d$ was overall the best algorithm, generating fewer belief states to solve problems.
The exception was the MazeWorld domain, where, due to the random transition back to the initial state, Grid-(L)RTDP had to generate most of the belief points.
%The benefit of using the informative heuristic varied across problems.
While some problems required only coarse discretization of beliefs, other problems required finer discretization to compute near optimal policies.

\begin{table*}[ht]
\begin{small}
\begin{tabular}{ |c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|  }
 \hline
 Domain & $|\Theta|$ & K & \multicolumn{4}{|c|}{Grid-VI} & \multicolumn{4}{|c|}{Grid-LRTDP($h_0$)} & \multicolumn{4}{|c|}{Grid-LRTDP($h_d$)}\\
  &  &  &V & t(s) & $|S|$ & $|P_K|$ & V & t(s) & $|S|$ & $|P_K|$ & V & t(s) & $|S|$ & $|P_K|$ \\
 \hline
 \multirow{3}{*}{MazeWorld}   & \multirow{3}{*}{$5$} &   1 & 18.90 & 18.07 & 148 & 740 & 19.14 & 3.79 & 148 & 739 & 19.11 & 3.95 & 148 & 602 \\
    &   & 4 & - & - & - & - & 17.08 & 208.35 & 148 & 10164 & 16.98 & 259.28 & 148 & 9711 \\
    &   & 8 & - & - & - &-&-&-&-&-& -&-&-&-\\
    %23168b0c0eef5774e1777450e0367a443c6f5d9c
 \hline
 \multirow{3}{*}{Acronym}   & \multirow{3}{*}{$3$}  
          & 1 & 15.25 & 15.25 &  6379 & 19137 & 15.76 & 9.93 & 6379 & 19137 & 15.72 & 9.95 &  6379 & 19137 \\
    &  & 4 & 7.89 & 346.90 & 6379 & 95685 & 8.63 & 38.91 &6379 & 88765 &8.70 & 31.81 & 6379 & 81728 \\
    &  & 8 & - & - & - & - & 8.43 & 221.75 & 6379 & 286953 & 8.48 & 165.40 & 6379 & 265752\\
    %23168b0c0eef5774e1777450e0367a443c6f5d9c
 \hline
 \multirow{3}{*}{BlocksWorld}   & \multirow{3}{*}{$2$}  
          & 1 & 3.67 & 3.98 & 125 & 250 & 3.56 & 3.34 & 125 & 250 & 3.58 & 1.72 & 125 & 134 \\
    &     & 4 & 3.13 & 10.02 & 125 &625 & 3.04 & 6.99 & 125 & 543 & 3.03 & 5.24 & 124 & 392 \\
    &    & 8 & 3.14 & 17.36 & 125 & 1125 & 3.04 & 12.0 & 125 & 890 & 3.04 & 8.85 & 123 & 625 \\
    %aec8053acd6a91fd5ec3f05071a4bff2debff725
 \hline
\end{tabular}
\end{small}
\caption{Time until convergence for different algorithms. $V$ represents the value when the policy is evaluated under the true environment ($M$). $t(s)$ is the running time in seconds. $|S|$ and $|P_K|$ represent the number of generated domain and belief states, respectively.}
\vspace{12pt} %% added space can be removed... just to balance end of page
\label{tab:covergence}
\end{table*}



\subsection{Anytime Performance}
We compare the following algorithms in terms of the anytime behaviors:
\begin{itemize}
\item Grid-(L)RTDP with $K=4, 8$ using $h_d$;
\item UCT where the rollout policy $\pi^{*}_d$ is an optimal policy for the domain SSP.
\end{itemize}

Each algorithm was run for $10^2$, $10^3$, $5 \cdot 10^3$, $10^4$, $5 \cdot 10^4$, $10^5$, $5 \cdot 10^5$, $10^6$ Grid-(L)RTDP/UCT trials.
For UCT, the specified number of trials are performed at each timestep online.
For Grid-(L)RTDP, the trials are performed offline.
Since both UCT and Grid-LRTDP before convergence do not necessarily reach the goal state, both of the algorithms are evaluated on the average costs for the first $50$ time steps.
Each run has time limit $10$m and memory limit $2$Gbytes.

Figure~\ref{img::anytime} shows the results.
UCT and Grid-(L)RTDP exhibited performances that complement each other. While UCT showed better anytime performance in Acronym, it took some time to achieve good performance in Blocks World, a small problem instance with $|\Theta|=2$.
Comparing Grid-(L)RTDP with different resolutions ($K$), using coarser grids generally resulted in better anytime behaviors as long as the resolution is sufficient.
Between Grid-RTDP and Grid-LRTDP, they exhibited comparable anytime behaviors.

\begin{figure*}[t!]
        \centering
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/reset5_online_302.png}
            \caption{MazeWorld}
		\label{img::trace}
        \end{subfigure}
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/spelling_online_8.png}
            \caption{Acronym}
		\label{img::trace}
        \end{subfigure}
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/blocks_online_1.png}
            \caption{BlocksWorld}
		\label{img::blocks_online}
        \end{subfigure}
        \caption{Anytime behaviors for different algorithms.}
        \vspace{8pt} %% added space can be removed... just to balance end of page
		\label{img::anytime}
\end{figure*}

%\begin{figure}[t!]
%    \centering
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/baker_101.png}
%            \caption{Grid VI ($K=10$)}
%		\label{img::trace}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/mcts_baker_101.png}
%            \caption{UCT with $5000$ iterations}
%		\label{img::trace}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/belief_changes_baker_101.png}
%            \caption{Belief Changes (Grid VI)}
%		\label{img::belief_change_without}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/mcts_belief_changes_baker_101.png}
%            \caption{Belief Changes (UCT)}
%		\label{img::belief_change_without}
%        \end{subfigure}
%	\caption{Example of combining implicit and explicit communication in the Maze World environment ($\beta = 0.3$).}
%	\label{img::grid_world}
%\end{figure}

\section{Related Work}
OAMDP is a framework that unifies different kinds of observer-aware behaviors.
Observer-aware behaviors include
\emph{Legible} behavior \citep{draganGeneratingLegibleMotion2013,miuraMaximizingLegibilityStochastic2021}, which implicitly conveys intentions via the choice of actions.
Similarly, \emph{explicable} behaviors \citep{zhangPlanExplicabilityPredictability2017,gongExplicablePolicySearch2022} conform to observers' expectations.
\emph{Deceptive} behaviors \citep{draganDeceptiveRobotMotion2015,mastersDeceptivePathPlanning2017,savasDeceptiveDecisionmakingUncertainty2022} hide agents' intentions or actively deceive observers.
\emph{Predictable} behaviors enable observers to predict future actions \citep{fisacGeneratingPlansThat2020,lepersHowExhibitMore2024a}.
Agents can also express their \emph{(in)capability} via the choice of their actions \citep{kwonExpressingRobotIncapability2018}.
OAMDP can also model the combination of implicit communication through behaviors and explicit communication with messages \citep{miuraObserverAwarePlanningImplicit2024}.
%While there have been several attempts to combine different kinds of observer-aware behaviors~\cite{draganGeneratingLegibleMotion2013,strouseLearningShareHide2018,chakrabortiBalancingExplicabilityExplanations2019,kulkarniUnifiedFrameworkPlanning2019}, there is no unifying framework that reveals the relationships among the approaches and the complexity of the problem.
%

OAMDP could be regarded as a special case of Decision Process with non-Markovian Reward (NMRDP) \citep{bacchusRewardingBehaviors1996,thiebauxDecisionTheoreticPlanningNonMarkovian2006}. Unlike OAMDPs, existing works on NMRDPs \cite{bacchusRewardingBehaviors1996,thiebauxDecisionTheoreticPlanningNonMarkovian2006,brafmanLTLfLDLfNonMarkovian2018} utilize temporal logic to describe rewards over histories. OAMDP, on the other hand, employs the belief of the observer to capture the non-Markovian nature of rewards.

Recent years have seen a surge of interest in the human tendency to ascribe intentionality to autonomous agents \citep{thellmanFolkPsychologicalInterpretationHuman2017,perez-osorioAdoptingIntentionalStance2020}.
In other words, humans often interpret the behaviors of 
autonomous agents as rational behaviors driven by intentions, beliefs, and desires.
While people do not necessarily understand the internal mechanisms of the agents, people can still predict the behaviors of the agents by ascribing intentionality to them.
OAMDPs rely on the tendency to take intentional stances to autonomous agents.

\section{Conclusion}
In this paper, we propose the first approximation algorithms for solving OAMDPs/SSPs, Grid-VI and Grid-(L)RTDP.
Both of the algorithms are
 based on discretizing the observer's beliefs into regular grids.
To justify the proposed algorithms, we show that the domain state and the belief of the observer constitute a sufficient statistics for OAMDPs (Proposition~\ref{prop::sufficient_statistics}).
Furthermore, we show that both algorithms converge to the unique value (Proposition~\ref{prop::grid_vi_convergence_mdp}, \ref{prop::grid_vi_convergence_ssp}, and \ref{prop::rtdp_convergence}) and provide performance guarantees under the standard assumptions (Propositions~\ref{prop::grid_vi_error_bound} and \ref{prop::rtdp_convergence}).
Our experimental results show that the proposed algorithms can compute near-optimal policies for OAMDPs/SSPs.
In particular, Grid-(L)RTDP can converge to a solution faster than Grid-VI and has anytime performance competitive with UCT.

\section{Acknowledgements}
This research was supported in part by the NSF grant IIS-2205153 and by the Alliance Innovation Lab Silicon Valley.

%\clearpage % [olivier] not in UAI template
%\bibliographystyle{named}
\bibliography{main}

\appendix
\section{Proofs}
\label{sec:proofs}

To prove Proposition~\ref{prop::lipschitz}, we first prove the Lipschitz continuity of $n$-step value function.
Let $V^{(0)}(s, b) = 0$ and 
$V^{(n+1)}(s, b) = \max_{a} R(s, a, b) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b^{s, a, s'})$.
Then we have:
\begin{lemma}
For a $(L_r, L_p)$-Lipschitz OAMDP,
$V^{(n)}$ is $L_{V^{(n)}}$-Lipschitz continuous, where $L_{V^{(n)}}$ satisfies:
\begin{equation}
L_{V^{(n+1)}} = L_r + \gamma L_p L_{V^{(n)}}
\end{equation}
\end{lemma}

\begin{proof}
Proof by induction on $n$. For the base case with $n=1$, 
\begin{align}
&|V^{(1)}(s, b_1) - V^{(1)}(s, b_2)| \\
&= |\max_a R(s, a, b_1) - \max_a R(s, a, b_2)| \\
&\leq \max_a |R(s, a, b_1) - R(s, a, b_2)| \\
&\leq L_r \|b_1 - b_2\|_{\infty}
\end{align}

For the induction step,
\begin{align}
&|V^{(n+1)}(s, b_1) - V^{(n+1)}(s, b_2)| \\
&= |\max_a R(s, a, b_1) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_1^{s, a, s'})\\
&- \max_a R(s, a, b_2) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq \max_a |R(s, a, b_1) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_1^{s, a, s'})\\
&- R(s, a, b_2) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq \max_a |R(s, a, b_1) - R(s, a, b_2)| \\
&+\gamma \sum_{s'} T(s, a, s') |V^{(n)}(s', b_1^{s, a, s'}) - V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq (L_r + \gamma L_p L_{V^{(n)}}) \|b_1 - b_2\|_{\infty}
\end{align}
\end{proof}

\proplipschitz*

\begin{proof}
Consider a sequence $\{L_n\}_{n \geq 1}$ where $L_1=L_r$ and:
\begin{equation}
L_{n+1} = L_r + \gamma L_p L_n
\end{equation}
Then,
\begin{align}
L_{n} &= L_r + \gamma L_p L_r + (\gamma L_p)^2 L_r + \cdots + (\gamma L_p)^{n-1} L_r \\
&= \frac{1 - (\gamma L_p)^n }{1-\gamma L_p} L_r
\end{align}
By our assumption, $\gamma L_p < 1$, so the sequence converges.
Let $L_{V^{*}} = \lim_{n \rightarrow \infty} L_n$.
$L_{V^{*}}$ must satisfy $L_{V^{*}} = L_r + \gamma L_p L_{V^{*}}$. Thus, we get Equation~\ref{eq::L_V}.
\end{proof}

%\begin{proposition}
%In an OAMDP with
%$\tau^{s, a, s'}(\theta) > 0 $ for $\forall \theta \in \Theta$, $s, s' \in S$, and $a \in A$, 
%belief transitions are Lipschitz continuous.
%\end{proposition}
\proptau*

\begin{proof}
Let $f^{s, a, s'}(b) = b^{s, a, s'} : \Delta^{\Theta} \rightarrow \Delta^{\Theta}$ be the belief transition after observing $\langle s, a, s' \rangle$. 
From the definition (Equation~\ref{eq::belief_update}),
$f^{s, a, s'}(b)(\theta_i) = \frac{\tau^{s, a, s'}_i b_i}{\sum_k \tau^{s, a, s'}_k b_k}$, where $\tau^{s,a,s'}_i = \tau^{s,a,s'}(\theta_i)$ and $b_i = b(\theta_i)$.
Then we have:
\begin{align}
J_{f^{s, a, s'}}(b)_{i, j} &= \begin{cases}
    \frac{\tau^{s, a, s'}_i (\sum_{k \neq i} \tau^{s, a, s'}_k b_k)}{(\sum_{k} \tau^{s, a, s'}_k b_k)^2} & i = j,\\
\frac{-\tau^{s, a, s'}_i \tau^{s, a, s'}_j b_j}{(\sum_{k} \tau^{s, a, s'}_k b_k)^2} & i \neq j,
\end{cases}\\
\|J_{f^{s, a, s'}}(b)\|_{\infty} &= \max_{1 \leq i \leq n} \sum_{1 \leq j \leq n} |J_{f^{s, a, s'}}(b)_{i,j}|,\\
&= \max_{1 \leq i \leq n} \frac{2 \tau^{s, a, s'}_i(\sum_{k \neq j} \tau^{s, a, s'}_k b_k) }{(\sum_k \tau^{s, a, s'}_k b_k)^2},
\end{align}
where $J_f$ is the Jacobian of $f$ and $\|\cdot\|_{\infty}$ is the induced operator norm.
%Since the OAMDP is well-behaved,
%we have $\tau_k > 0$ for all $k=1,\cdots,n$.
Let 
$\tau_{\min} = \min_{s,a,s',k} \tau_k^{s,a,s'}$ 
and
$\tau_{\max} = \max_{s,a,s',k} \tau_k^{s,a,s'}$ .
Note that,
for every $b \in \Delta^n$, 
$\sum_{k\neq i} \tau^{s, a, s'}_k b_k \leq \tau_{\max}$ and 
$\sum_{k} \tau^{s, a, s'}_k b_k \geq \tau_{\min}> 0$. Then we get $\|J_{f^{s, a, s'}}(b)\|_{\infty} \leq 2 (\frac{\tau_{\max}}{\tau_{\min}})^2$.
\end{proof}

\lemmaonestep*
%\begin{lemma}
%For an OAMDP with $L_{V^{*}}$-Lipschitz continuous value function, we have the following bound on one-step approximation errors using a regular grid with resolution $K$:
%\begin{equation}
%\|\mathcal{T}_K V^{*}- V^{*}\|_{\infty} \leq 
%\frac{L_{V^{*}}}{K}
%\end{equation}
%\end{lemma}

\begin{proof}
For all $n \geq 0, K \geq 1$, $s \in S$, and $b \in \Delta^{|\Theta|}$,
\begin{align}
&| V^{*}(s,b) - \mathcal{T}_K V^{*}(s, b)| \\
%&=  | V^{*}(s,b) - (\mathcal{A}_K \circ \mathcal{T})(V^{*})(s, b_i)| \text{ (by definition) }\\
&=  | V^{*}(s,b) - \sum_{b_i \in P_K(b)} \lambda_i \mathcal{T}V^{*}(s, b_i)| \text{ (by definition) }\\
&=  | \sum_{b_i \in P_K(b)} \lambda_i (V^{*}(s,b) - V^{*}(s, b_i))| \text{ ($\mathcal{T}$ is a fixpoint of $V^{*}$) } \\
&\leq  \sum_{b_i \in P_K(b)} \lambda_i |V^{*}(s,b) -  V^{*}(s, b_i)| \text{ (triangle inequality) } \\
&\leq  \sum_{b_i \in P_K(b)} \lambda_i L_{V^{*}} \|b - b_i\|_{\infty} \\
&\leq L_{V^{*}} \frac{1}{K}
\end{align}
\end{proof}

\properrorbound*
%\begin{theorem}
%Let $\epsilon_K=(L_r + \gamma L_p L_{V^{*}})\frac{1}{K}$ be the one-step approximation error with resolution $K$, we have:
%\begin{equation}
%\|V^{*} - V_K^{*}\|_{\infty} \leq 
%\frac{\epsilon_K}{(1-\gamma)} 
%\end{equation}
%\end{theorem}

\begin{proof}
\begin{align}
&\|V^{*} - V_K^{*}\|_{\infty} \\
&\leq \|V^{*} - \mathcal{T}_K V^{*} + \mathcal{T}_K V^{*} - V_K^{*}\|_{\infty} \\
&\leq \|V^{*} - \mathcal{T}_K V^{*}\|_{\infty} + \|\mathcal{T}_K V^{*} - \mathcal{T}_K V_K^{*}\|_{\infty} \\
&\leq \frac{L_{V^{*}}}{K} + \gamma \|V^{*} - V_K^{*}\|_{\infty}
\end{align}
\end{proof}

%\begin{proof}
%\cite{munosPerformanceBoundsL_p2007}
%\begin{align}
%&||V^{*} - V^{\pi_{n, K}}||_{\infty} \\
%&= ||\mathcal{T}V^{*} -  \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty}\\
%&= ||\mathcal{T}V^{*} - \mathcal{T}^{\pi_{n, K}}V_{n,K} 
% +\mathcal{T}^{\pi_{n, K}}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&= ||\mathcal{T}V^{*} - \mathcal{T}V_{n,K} 
% +\mathcal{T}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&\leq ||\mathcal{T}V^{*} - \mathcal{T}V_{n,K}||_{\infty} 
% + ||\mathcal{T}^{\pi_{n, K}}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&\leq \gamma ||V^{*} - V_{n,K}||_{\infty} 
% + \gamma ||V_{n,K} - V^{\pi_{n, K}}||_{\infty}\\
%&\leq \gamma ||V^{*} - V_{n,K}||_{\infty} 
% + \gamma ||V_{n,K} - V^{*}||_{\infty} + ||V^{*} - V^{\pi_{n, K}}||_{\infty}
%\end{align}
%
%Thus, we get:
%\begin{equation}
%||V^{*} - V^{\pi_{n, K}}||_{\infty} \leq \frac{2 \gamma }{1-\gamma}||V^{*} - V_{n, K}||_{\infty}
%\end{equation}
%Moreover,
%\begin{align}
%&||V^{*} - V_{n+1, K}||_{\infty} \\
%&\leq ||\mathcal{T}V^{*} - \mathcal{T}V_{n, K}||_{\infty}
%+  ||\mathcal{T}V_{n, K} - V_{n+1, K}||_{\infty} \\
%&\leq \gamma ||V^{*} - V_{n, K}|| + \epsilon_K
%\end{align}
%By taking the upper limit, we get:
%\begin{equation}
%\limsup_{n \rightarrow \infty }||V^{*} - V_{n, K}|| \leq \epsilon_K / (1 - \gamma)
%\end{equation}
%\end{proof}

\section{Pseudocode for Grid-LRTDP}
\label{sec::lrtdp_pseudocode}
Algorithm~\ref{algorithm::lrtdp} shows the pseudocode for Grid-LRTDP.
The algorithm operates identically to Grid-RTDP, except that at the end of each trial, the algorithm checks if states visited during the trial can be labeled as solved.
\begin{algorithm}
    \label{algorithm::lrtdp}
    \caption{Grid-LRTDP}
    \begin{algorithmic}[1]
    \Function{Grid-LRTDP}{$s_0$, $b_0$, $\epsilon$, $K$}
        \While{$\exists b_i \in P_K(b_0) \neg \langle s_0, b_0\rangle.solved$}
            \State LRTDPTRIAL($s_0$, $b_0$, $\epsilon$, $K$)
        \EndWhile
    \EndFunction
    \\
    \Function{LRTDPTRIAL}{$s_0$, $b_0$}
        \State $visited \gets Stack::new()$
        \State $s \sim s_0$
        \State $b \gets b_0$
        \While{episode continues}
            \State \textcolor{red}{sample $b_i \in P_K(b)$ with the weight $\lambda_i$} 
            \State $visited.push(\langle s, b_i \rangle)$ 
            \State $a^{*} \leftarrow  \min_a Q_K(s, b_i, a)$ 
            \State $V_K(s, b_i) \gets Q_K(s, b_i, a^{*})$ 
            \State $s' \sim \Pr(\cdot|s, a^{*})$ 
            \State $b \gets b_i^{s, a, s'}$ 
        \EndWhile
        \\
        \While{$\neg visited.is\_empty()$}
            \State $\langle s, b \rangle \leftarrow visited.pop() $
            \If{$\neg$ CHECKSOLVED($s$, $b$, $\epsilon$, $K$)}
                \State \textbf{break}
            \EndIf
        \EndWhile
    \EndFunction
    \end{algorithmic}
    \label{algorithm::lrtdp}
\end{algorithm}

Algorithm~\ref{algorithm::check_solved} shows the procedure for labeling states.
Starting from a given $\langle s, b \rangle$ the algorithm visits state that could be visited under the current best policy, and checks if the residuals of Bellman updates are smaller than a given threshold $\epsilon$.
\begin{algorithm}
    \caption{CHECKSOLVED}
    \label{algorithm::check_solved}
    \begin{algorithmic}[1]
    \Function{CHECKSOLVED}{$s$, $b$, $\epsilon$, $K$}
        \State $rv \gets true$
        \State $open \gets Stack::new()$
        \State $closed \gets Stack::new()$
        \If{$\neg \langle s, b \rangle.solved$}
            \State $open.push(\langle s, b \rangle)$
        \EndIf
        \While{$\neg open.is\_empty()$}
            \State $\langle s, b \rangle \leftarrow open.pop()$
            \State $closed.push(\langle s, b \rangle)$
            \State $a^{*} \leftarrow  \min_a Q_K(s, b, a)$ 
            \State $\epsilon_{res} \leftarrow |V_K(s, b) - Q_K(s, b, a^{*}|$
            \State $V_K(s, b) \gets Q_K(s, b, a^{*})$ \label{line::value_update}
            \If{$\epsilon_{res} < \epsilon$}
                \State \textbf{continue}
            \EndIf
            \ForAll{ $s' \in S$ such that $T(s, a, s') > 0$}
                \ForAll{ $b_i \in P_K(b^{s, a, s'})$ such that $\lambda_i > 0$}
                    \If{$\neg \langle s', b_i\rangle.solved \wedge \neg \langle s', b_i\rangle \in open \wedge \neg \langle s', b_i\rangle \in closed $}
                        \State $open.push(\langle s', b_i \rangle)$
                    \EndIf
                \EndFor
            \EndFor
        \EndWhile
        \\
        \If{$rv = true$}
            \ForAll{$\langle s, b \rangle \in closed$}
                \State $\langle s, b \rangle.solved \leftarrow true$
            \EndFor
        \Else
            \While{$\neg closd.is\_empty()$}
                \State $\langle s, b \rangle \leftarrow open.pop()$
                \State $a^{*} \leftarrow  \min_a Q_K(s, b, a)$ 
                \State $V_K(s, b) \gets Q_K(s, b, a^{*})$ \label{line::value_update}
            \EndWhile
        \EndIf
    \EndFunction
    \end{algorithmic}
\end{algorithm}

\end{document}