\documentclass{uai2024}

%\documentclass[accepted]{uai2024} % after acceptance, for a revised version;                  
% also before submission to see how the non-anonymous paper would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
% Modern (has noticeable issues)                                       % \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon 
                                          % ptmx; less tested, no support)                  
% NOTE: Only keep *one* line above as appropriate, as it will be replaced                      
%       automatically for papers to be published. Do not make any other                        
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}                                           % \usepackage[british]{babel} 

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}

%% Language setting
%% Replace `english' with e.g. `spanish' to change the document language
%\usepackage[english]{babel}
%%\usepackage{ijcai24}

% Set page size and margins
% Replace `letterpaper' with `a4paper' for UK/EU standard size
%\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}

% Useful packages
\usepackage{amsmath}
\usepackage{graphicx}
%\usepackage[colorlinks=true, allcolors=blue]{hyperref}
\usepackage{amsthm}
\usepackage{amsfonts}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{multirow}
\usepackage{xcolor}
\usepackage{thm-restate}

\theoremstyle{example}
\newtheorem*{example}{Example}
\newtheorem{theorem}{Theorem}
\theoremstyle{definition}
\newtheorem*{definition}{Definition}
\theoremstyle{lemma}
\newtheorem{lemma}{Lemma}
\theoremstyle{proposition}
\newtheorem{proposition}{Proposition}
\theoremstyle{probably}
\newtheorem{probably}{Probably True}

\theoremstyle{corollary}
\newtheorem{corollary}{Corollary}
%\theoremstyle{open}
%\newtheorem{open}{Open Problem}
%\theoremstyle{idea}
%\newtheorem*{idea}{Idea}
%\theoremstyle{question}
%\newtheorem{question}{Question}
%\newenvironment{Proof}{%
%  \renewcommand{\proofname}{Proof}\proof}{\endproof}
% 
%% if needed . . .
%% \renewcommand{\maketitlehooka}{\vbox to 1.75in\bgroup}% was 2.375in

\newenvironment{sketch}{%
  \renewcommand{\proofname}{Proof Sketch}\proof}{\endproof}

\newcommand{\mypara}[1]{\vspace{0pt}\noindent\textbf{#1}~~~}
\newcommand{\ignore}[1]{}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}

\title{Approximation Algorithms for Observer Aware MDPs}

\author{}

\begin{document}
\maketitle

\begin{abstract}
We present approximation algorithms for Observer-Aware Markov Decision Processes (OAMDPs). OAMDPs model sequential decision-making problems in which rewards depend on the beliefs of an observer about the goals, intentions, or capabilities of the observed agent. The first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer's belief into regular grids. Based on the same discretization, the second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP) called Grid-RTDP. Unlike Grid-Vi, Grid-RTDP focuses its updates on promising states using heuristic estimates. 
We provide theoretical guarantees of the proposed algorithms and demonstrate that Grid-RTDP has a good anytime performance comparable to the existing approach without performance guarantees.
\end{abstract}

\section{Introduction}

Effective communication of intentions, goals, and desires is crucial in our daily interactions and is equally vital for autonomous agents. For instance, consider an autonomous vehicle (AV) approaching a crosswalk with a pedestrian nearby. While the AV might optimize for travel time by approaching the crosswalk at high speed before stopping, this can be unsettling for the pedestrian. A more reassuring approach would be for the AV to slow down well before reaching the crosswalk, signaling its intention to stop. We term such actions that take into account the perspective or beliefs of an observing agent as \emph{observer-aware} behaviors. 
Observer-aware behaviors range from making the agent's goal clear \citep{draganGeneratingLegibleMotion2013}, demonstrating its capabilities \citep{kwonExpressingRobotIncapability2018} or disguising possible intentions \citep{mastersDeceptivePathPlanning2017}.

The Observer-Aware Markov Decision Process (OAMDP) \citep{miuraUnifyingFrameworkObserveraware2021} offers a general framework for producing observer-aware behaviors.
The OAMDP framework assumes a model of how the agent's actions would be interpreted by the observer. 
In OAMDPs, 
possible goals, intentions, or capabilities of the observed agent are represented as types.
After the observed agent takes an action, the observing agent updates its belief over the possible types, which determines the reward function.
%In OAMDPs, rewards can depend on the observer's beliefs about the agent's types.

While OAMDP allows modeling various observer-aware planning problems in a unified way, solving OAMDPs is shown to be intractable in the worst case \citep{miuraUnifyingFrameworkObserveraware2021}.
The intractability stems from the fact that rewards depend on the belief of the observer, which in turn depends on the history so far.
Previous work proposed using Monte-Carlo Tree Search (MCTS) to solve OAMDPs for the finite-horizon objective~\citep{miuraUnifyingFrameworkObserveraware2021}.
While MCTS exhibits good anytime behavior,
it does not provide guarantees on the qualities of the resulting policies.

In this paper, we propose the first approximation algorithms for OAMDPs.
We begin by establishing that the domain state and the observer's belief are sufficient for optimal control in OAMDPs (Proposition~\ref{prop::sufficient_statistics}). 
%Then we discuss assumptions needed for applying approximate algorithms.
%Since the value function in OAMDPs can be discontinuous over the belief of the observer, 
%we identify a subclass of OAMDPs where the value function remains continuous over the observer's belief.
Our first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the belief of the observer into regular grids.
We show that Grid-VI converges to the unique fixpoint both in discounted (Proposition~\ref{prop::grid_vi_convergence_mdp}) and undiscounted (Proposition~\ref{prop::grid_vi_convergence_ssp}) settings under the standard assumptions, and provide the error bounds for the discounted setting (Proposition\ref{prop::grid_vi_error_bound}).
%At each iteration of the algorithm, the values at each grid points are updated, where
%values at other beliefs are linearly interpolated using grid points.
A potential drawback of Grid-VI is that it can waste time updating values at irrelevant states. 
To address the issue, we propose a variant of Real-Time Dynamic Programming (RTDP)~\citep{bartoLearningActUsing1995a} to solve OAMDPs, called Grid-RTDP.
Grid-RTDP utilizes heuristic estimates to focus updates on promising states.
We demonstrate that Grid-RTDP retains RTDP's desirable property (Proposition~\ref{prop::rtdp_convergence}). Our experimental results indicate that our proposed algorithms are capable of computing near-optimal policies. Specifically, Grid-RTDP solves problems significantly faster than Grid-VI and offers anytime performance comparable to MCTS.

\section{Backgrounds and Notations}
\subsection{Markov Decision Processes}
A Markov decision process (MDP) models sequential decision-making under uncertainty. An MDP is described by a tuple 
$M = \langle S, A, T, R, \gamma, d_0 \rangle$. $S$ is a set of states. $A$ is a set of actions. $T(s_t, a_t, s_{t+1})$ is the probability of $S_{t+1}{=}s_{t+1}$ when $A_t{=} a_t$ and $S_t{=}s_{t}$. $R$ is a reward for taking $a_t$ at $s_t$.
$\gamma$ is a parameter called the discount factor.
$d_0$ is the initial state distribution $s_0 \sim d_0$.
%The absorbing terminal state always transitions back to itself with zero reward.

A solution of an MDP is called a \emph{policy} ($\pi$).
%We use the following two types of policies in the paper. A \emph{stationary policy} is a conditional distribution of actions given a state.
%A \emph{history-dependent policy} is a conditional distribution of actions given a history, where a history $h_{t+1}$ is a sequence of state-action pairs up to time $t$ and the last visited state $s_{t+1}$.
An optimal policy for an MDP is a policy that maximizes $\mathbb{E}[\sum_{t=0}^{\infty} \gamma^t R(S_t, A_t)|d_0, \pi]$.
A policy ($\pi$) induces a value function $V^{\pi}(s) = \mathbb{E}[\gamma^t R(S_t, A_t)|S_0=s,\pi]$.
The optimal value function $V^{*}$ is a value function corresponding to an optimal policy.
%For a particular state, a value function $V^{\pi}_{H}$ represents the expected return given a policy $\pi$ up to time step $H$. 
%When $H$ is finite, we call it a value function for a finite horizon.

\subsection{Stochastic Shortest Path Problems}
A \emph{stochastic shortest path problem} (SSP) is an undiscounted, cost-based counterpart of an MDP. An SSP is represented by
 a tuple $\langle S, A, T, C, d_0, G \rangle $ where:
$S$, $A$, $T$ are the same as in an MDP. $C(s_t,a_t): S \times A \rightarrow \mathbb{R_{+}}$ is the cost of performing $a_t$ at $s_t$.
$d_0$ is the initial state distribution. $G \subset S$ is a set of goal states.
The goal states are absorbing and transitions out of goal states have zero costs.
%In this paper, we only consider finite sets of states and actions.

A solution of an SSP is a \emph{policy}. 
%A \emph{deterministic policy} $\pi$ maps a state $s$ to an action $a \in A$.
%A \emph{stochastic policy} $\pi$ maps a state $s$ to a probability distribution on $A$.
%A policy $\pi$ induces a value function $V^{\pi}(s) = \mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t)| d_0, \pi]$, which represents the expected cost of reaching a goal state from $s$ by following $\pi$.
An \emph{optimal policy} $\pi^{*}$ is a policy that minimizes 
$\mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t)| d_0, \pi]$.
We restrict our attention to problems in which there exists at least one \emph{proper policy}, which reaches the goal from all states with probability $1$.
Under this assumption, an SSP is guaranteed to have an optimal policy that is proper \citep{bertsekasAnalysisStochasticShortest1991}.


\begin{figure*}[t]
    \centering
        \begin{subfigure}[b]{0.40\linewidth}
            \centering
            \includegraphics[height=1.6in]{imgs/mcts_baker5_403.png}
            \caption{Environment}
            \label{img::baker5_403}
        \end{subfigure}%
        \begin{subfigure}[b]{0.40\linewidth}
            \centering
            \includegraphics[height=1.6in]{imgs/mcts_belief_changes_baker_403.png}
            \caption{Observer's belief ($\beta=0.3$)}
            \label{img::belief_changes_baker5_403}
        \end{subfigure}
        \vspace{-8pt}
        \caption{MazeWorld Domain}
        \label{img::mazeworld}
\end{figure*}

\subsection{Observer-Aware MDPs}
Observer-Aware Markov Decision Processes (OAMDPs) extend MDPs by allowing the reward to depend on the observer's assumed belief over the types of the observed agent \citep{miuraUnifyingFrameworkObserveraware2021}.

\begin{definition}
An OAMDP is a tuple\footnote{The original work~\citep{miuraUnifyingFrameworkObserveraware2021} allowed an arbitrary function from $H^{*}$ to $\Delta^{|\Theta|}$ to update the observer's belief. Here, we restrict our attention to a case where the observer updates its belief in a Bayesian fashion.}\\ 
\centerline{~~~$M = \langle S, A, T, \gamma, d_0, \allowbreak \Theta, b_0, \tau, R \rangle$ where:}
\begin{itemize}
    \item $S$, $A$, $T$, $\gamma$, and $d_0$ are the same as in MDPs.
        In this paper, we assume $S$ and $A$ are finite.
	\item  $\Theta$ is a (finite) set of \emph{types}, representing a  characteristic of the agent such as possible goals, intentions, or capabilities.
        \item $b_0 \in \Delta^{|\Theta|}$ is the initial belief of the observer over the types, where
        $\Delta^{|\Theta|}$ is a simplex on $\Theta$.
%	\item $B: H^{*} \rightarrow \Delta^{|\Theta|}$ represents the assumed belief of the observer given a history.
%	$H^{*}$ is the set of all finite histories and $\Delta^{|\Theta|}$ is a simplex on $\Theta$.
        \item $\tau: S \times A \times S \times \Theta \rightarrow [0, 1]$	is the probability of the observer witnessing a transition $\langle s, a, s'\rangle$ given $s$ and $\theta$.
        $\tau$ can represent different policies and transition functions of the observed agent depending on types.
	\item $R : S \times A \times \Delta^{|\Theta|} \rightarrow \mathbb{R}$ is a belief-dependent reward function. 
    In this paper, we assume that the rewards can be represented as a linear combination of \emph{domain} and \emph{belief-dependent} rewards.
    That is, $R(s, a, b) {=} w_d R_d(s, a) + w_b R_b(b)$ for $w_d, w_b {\in} \mathbb{R_{+}}$, where $R_d$ and $R_b$ represent domain and belief-dependent reward, respectively. 
%	Note that the reward depends on histories through the beliefs.
% describes how desirable it is to take an action given a state and a belief $b \in \Delta^{|\Theta|}$.  
%	When the reward depends only on $\Delta^{|\Theta|}$, we abuse the notation slightly and treat $R$ as
%	$\Delta^{|\Theta|} \rightarrow \mathbb{R}$.
\end{itemize}

After observing a transition $\langle s, a, s' \rangle$,  the observer is assumed to update its belief ($b_t$) using Bayes' rule:
\begin{equation}
b_{t+1}^{s,a,s'}(\theta) = \frac{ \tau(a, s'|s, \theta) \cdot b_t(\theta) }{ \sum_{\theta' \in \Theta} \tau(a, s'|s, \theta')  \cdot b_t(\theta') }.
\label{eq::belief_update}
\end{equation}

A solution to an OAMDP is a policy that maximizes the expected discounted return:
\begin{equation}
\mathbb{E}[\sum_{t=0}^{\infty} \gamma^t R(S_t,A_t,B_t)|d_0, \pi]. 
\end{equation}
%For an OAMDP $M = \langle S, A, T, \gamma, d_0, \allowbreak \Theta, b_0, \tau, R, \rangle$, we define the corresponding \emph{domain MDP} as
%$M_d = \langle S, A, T, \gamma, d_0, R_d \rangle$.
\end{definition}

For example, Figure~\ref{img::mazeworld} shows an example of an OAMDP with $\Theta= \{\theta_A, \theta_B, \theta_C, \theta_D, \theta_E \}$, where each type corresponds to the observed agent's goal.
$\tau(a, s'|s, \theta)$ is typically set to  
$T_{\theta}(s, a, s') \pi_{\theta}(s, a)$, where 
$\pi_{\theta}$ is an assumed policy of the observed agent given a type $\theta$ and
$T_{\theta}$ is a transition function given a type $\theta$.
For example, $\pi_{\theta_A}$ represents a policy given the observed agent is going to the goal $A$.
When $\Theta$ represents different capabilities of the observed agent,
$T_{\theta}$ represents transition functions corresponding to different capabilities.
When $T_{\theta}$ is the same for all $\theta \in \Theta$, 
$\tau(a, s'|s, \theta)$ simplifies to $\pi_{\theta}(s, a)$ in Equation~\ref{eq::belief_update}.

\ignore{
\begin{figure}[h]
    \centering
        \begin{subfigure}[b]{0.6\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/mcts_baker5_403.png}
            \caption{Environment}
            \label{img::baker5_403}
        \end{subfigure}
        \begin{subfigure}[b]{0.6\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/mcts_belief_changes_baker_403.png}
            \caption{Observer's belief ($\beta=0.3$)}
            \label{img::belief_changes_baker5_403}
        \end{subfigure}
        \caption{MazeWorld Domain}
        \label{img::mazeworld}
\end{figure}
}

\mypara{Noisy Rational Model}
A common approach in modeling the observer involves using inverse planning. This assumes that the observed agent behaves approximately rationally given its type.
\cite{bakerActionUnderstandingInverse2009} explored the connection between Bayesian reasoning and human understanding of goals. A model presented in their work presumes noisy rationality:
% 
\begin{equation}
\pi_{\theta}(s, a) \propto \exp^{\beta Q_{\theta}^{*}(s, a)},
\label{eq::noisy_rational}
\end{equation}
where $Q_{\theta}^{*}$ is the optimal Q-value
%$Q^{*}(s,a|\theta) = \mathbf{E}[\sum_{t=0}^{\infty} \gamma^t R_t|S_0{=}s,A_0{=}a,\pi^{*}, \theta]$ 
representing how good $a$ is given $s$ and $\theta$. 
Note that, $Q_{\theta}^{*}$ is computed with respect to $T_{\theta}$ and $R_{\theta}$ (the reward function corresponding to $\theta$), 
$\beta \in \mathbb{R}$ serves as a hyper-parameter representing the agent's rationality level. Intuitively, it is assumed that the observed agent selects an action with a probability exponentially proportional to the quality of the action at the current state.
Figure~\ref{img::belief_changes_baker5_403} shows the observer's belief changes according to Equation~\ref{eq::noisy_rational}.

\mypara{Belief-Dependent Rewards}
OAMDP can produce various observer-aware behaviors by changing $R_b$.
For instance, to clarify intentions, $R_b$ might be defined as the negative total variation (TV) or the Euclidean distance between the current and target beliefs, where the target belief is $b(\theta^{})=1$ for the intended type $\theta^{} \in \Theta$.
%For example, if the goal is to make the intention clear \citep{draganGeneratingLegibleMotion2013},
%$R_b$ could be the negative total variation (TV) or Euclidean distance between the current belief and the target
%belief ($b(\theta^{*})=1$ for the intended type $\theta^{*} \in \Theta$).
On the other hand, if the observed agent wants to obscure its intention, rewards could be the entropy of the observer's belief.

\section{Properties of OAMDPs}
In this section, we discuss properties of OAMDPs necessary for developing proposed algorithms.
\subsection{Sufficient Statistics for Optimal Control}
To compute policies for OAMDPs,
previous work \citep{miuraUnifyingFrameworkObserveraware2021} used a general-purpose method 
%such as AO$^{*}$ \cite{nilsonnilsPrinciplesArtificialIntelligence1980} 
such as UCT \citep{kocsisBanditBasedMonteCarlo2006} to
compute history-dependent policies.
%that do not exploit the structure of OAMDPs.
However,  
we show that
the current state and the belief of the observer contain sufficient information to choose the best action to take:
\begin{proposition}
The current state and the belief of the observer are sufficient for optimal control for OAMDPs.
\label{prop::sufficient_statistics}
\end{proposition}
\begin{proof}
For all $s_t,s_{t+1} \in S$, $a_t \in A$, $b_t \in \Delta^{|\Theta|}$, $h_t \in H_t$:
\begin{align}
    &\Pr(s_{t+1}, b_{t+1}|s_t, a_t, b_t, h_t) \\ 
    &= \Pr(b_{t+1}|s_t, a_t, s_{t+1}, b_t, h_t)\Pr(s_{t+1}|s_t, a_t, b_t, h_t) \\ 
    &= [b_{t+1} = b_t^{s_t, a_t, s_{t+1}}] T(s_t, a_t, s_{t+1}) \text{ by definition}\\
    &= \Pr(s_{t+1}, b_{t+1}|s_t, a_t, b_t) 
\end{align}
where $[\cdot]$ is the Iverson bracket. Moreover, $R$ only depends on $S_t$, $A_t$, and $B_t$ by definition.
\end{proof}

With Proposition~\ref{prop::sufficient_statistics} in place, we can look for policies of the forms $\pi: S \times \Delta^{|\Theta|} \times A \rightarrow [0, 1]$.
In other words, we can look for policies to \emph{belief MDP}, whose state space is $S \times \Delta^{|\Theta|}$ instead of $S$. 
%$M_b = \langle S \times \Delta^{|\Theta|}, A, T_b, R, \gamma, d_0^b \rangle$ where:
%\begin{align}
%    T_b(\langle s, b \rangle, a, \langle s', b' \rangle)
%    &= [b'=b^{s,a,s'}]T(s, a, s'),\\
%    d_0^b(\langle s, b \rangle)
%    &=[b=b_0]d_0(s).
%\end{align}
%\begin{itemize}
%    \item \[T_b(\langle s, b \rangle, a, \langle s', b' \rangle) = \begin{cases}
%        0 & b' \neq b^{s, a, s'} \\
%        T(s, a, s') & b' = b^{s, a, s'} 
%    \end{cases}\]
%    
%    \item \[d_0^b(\langle s, b \rangle) = \begin{cases}
%        0 & b \neq b_0 \\
%        d_0(s) & b = b_0
%    \end{cases}\]
%\end{itemize}
%corresponding to the original OAMDP, where the set of states is $S \times \Delta^{|\Theta|}$ instead of $S$.
Note that, while the original OAMDP has a finite number of states, the belief MDP has a continuous state space.
Proposition~\ref{prop::sufficient_statistics}
 is
analogous to how beliefs over states (belief states) are sufficient for optimal control for POMDPs~\citep{kaelblingPlanningActingPartially1998}.
However, while
most solution methods for POMDPs~\citep{monahanSurveyPartiallyObservable1982,pineauPointbasedValueIteration2003a} rely on 
piecewise linear convexity (PWLC) of the value function,
we see that the value functions for OAMDPs are not necessarily PWLC.
For example, consider using the negative Euclidean distance from the intended type as $R_b$.
$R_b$ is not PWLC on $\Delta^{|\Theta|}$.
Therefore, solution methods for POMDPs are not directly applicable to OAMDPs.

\subsection{Discontinuity in Value Functions}
Before delving into our proposed algorithms, we address a potential issue in developing an approximation algorithm for OAMDPs.
Both of our proposed algorithms approximate values by grouping similar beliefs.
This approach operates under the implicit assumption that nearby beliefs should yield similar values.
However, we demonstrate that, in a general OAMDP, the rate at which the observer's belief changes can be unbounded, thus invalidating this assumption. To illustrate this issue, consider the following example:
%But then, $f: b \mapsto b^{s,a,s'}$ is no standard Lipschitz function.

\begin{example}
Let us assume that we have an OAMDP with:
\begin{itemize}
\item $\Theta=\{\theta_0, \theta_1, \theta_ 2\}$,
\item $b_1=(1-\epsilon,\epsilon, 0) \in \Delta^3$,
\item $b_2=(1-\epsilon, 0, \epsilon) \in \Delta^3$, and
\item $\tau^{s,a,s'} = (\tau_0=0, \tau_1 >0, \tau_2>0)$.
\end{itemize}
Then, $b_1^{s,a,s'} = (0,1,0)$ and  $b_2^{s,a,s'}  = (0,0,1)$. Thus,
\begin{align}
\frac{\|b_1^{s, a, s'}-b_2^{s, a, s'}\|_{\infty}}{\|b_1-b_2\|_{\infty}}
& = \frac{\|(0,1,-1)\|_{\infty}}{\|(0,\epsilon,-\epsilon)\|_{\infty}} = \frac{1}{\epsilon}.
\end{align}
$\frac{\|b_1^{s, a, s'}-b_2^{s, a, s'}\|_{\infty}}{\|b_1-b_2\|_{\infty}}$ diverges as $\epsilon \rightarrow 0$.
\end{example}


\subsection{Lipschitz OAMDPs}
Given the potential discontinuity in values,
we discuss special cases of OAMDPs with Lipschitz-continuous reward and belief transitions.
\begin{definition}
An OAMDP is $(L_r, L_p)$-Lipschitz if for all $s, s' \in S$, $a \in A$, and $b_1, b_2 \in \Delta^{|\Theta|}$:
\begin{align}
|R(s,a,b) - R(s, a, b')| \leq L_r \|b_1 - b_2\|_{\infty}, \\
\|b_1^{s, a, s'} - b_2^{s, a, s'}\|_{\infty}
\leq L_p\|b_1 - b_2\|_{\infty}.
\end{align}
\end{definition}
Intuitively, in Lipschitz OAMDPs, beliefs close to each other have similar rewards and update to close beliefs.
The definition is analogous to Lipschitz continuity of continuous MDPs in general \citep{rachelsonLocalityActionDomination2010}.

Lipschitz continuity of reward and belief transitions can be related to
Lipschitz continuity of the value function under a favorable assumption:
\begin{restatable}{proposition}{proplipschitz}
\label{prop::lipschitz}
For a $(L_r, L_p)$-Lipschitz OAMDP,
if $\gamma L_p < 1$,
then $V^{*}$ is $L_{V^{*}}$-Lipschitz continuous where: \begin{equation}
\label{eq::L_V}
L_{V^{*}} = \frac{L_r}{1 - \gamma L_p}.
\end{equation}
\end{restatable}

\begin{proof}
See Appendix~\ref{sec:proofs}
\end{proof}
As we will see later, Lipschitz continuity enables us to provide the error bound for discretization (Proposition~\ref{prop::grid_vi_error_bound}).

%\begin{example}
%Consider the following example:
%\begin{itemize}
%\item a problem with 2 types,
%\item $b=(b_1,b_2)$
%\item $\tau_1 >0$ and $\tau_2>0$.
%\end{itemize}
%
%\begin{tiny}
%\begin{align}
% f(b) & = \frac{\tau \odot b}{\norm{\tau \odot b}_1} \\
% & = \frac{(\tau_1 \cdot b_1, \tau_2 \cdot b_2)}{\tau_1 \cdot b_1 + \tau_2 \cdot b_2} \\
% \frac{\partial f}{\partial b_1} (b)
%  & = \left(
%  \frac{\tau_1 }{\tau_1 \cdot b_1 + \tau_2 \cdot b_2}
%  -\frac{(\tau_1 \cdot b_1) \cdot \tau_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  ,
%  -\frac{(\tau_2 \cdot b_2) \cdot \tau_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  \right) \\
%  & = \left(
%  \frac{\tau_1 \cdot (\tau_1 \cdot b_1 + \tau_2 \cdot b_2) - (\tau_1 \cdot b_1) \cdot \tau_1 }{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  ,
%  -\frac{(\tau_1 \tau_2) \cdot b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  \right) \\
%  & = \left(
%  \frac{+(\tau_1 \tau_2) \cdot b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  ,
%  \frac{-(\tau_1 \tau_2) \cdot b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  \right) \\
% \frac{\partial f}{\partial b_2} (b)
%  & = \left(
%  \frac{-(\tau_1 \tau_2) \cdot b_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  ,
%  \frac{+(\tau_1 \tau_2) \cdot b_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}
%  \right) \\
%    ||J_f(b)||_1 &= 
%  \frac{(\tau_1 \tau_2) \cdot b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2} +  \frac{(\tau_1 \tau_2) \cdot b_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2} \\
%  &= \frac{(\tau_1 \tau_2) \cdot (b_1 + b_2)}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2)^2}\\
%  &= \frac{\tau_1 \tau_2}{((\tau_1 - \tau_2)b_1 + \tau_2)^2} \\
%  \max_{b \in \Delta^2} ||J_f(b)||_1 &= \begin{cases}
%   1 & \tau_1 = \tau_2 \\
%   \frac{\tau_2}{\tau_1} & \tau_1 - \tau_2 > 0 \\
%   \frac{\tau_1}{\tau_2} & \tau_1 - \tau_2 < 0 \\
%  \end{cases}
%\end{align}
%\end{tiny}
%Since $||J_f(b)||_1$ is bounded for all $b \in \Delta^2$, $f$ is Liphshitz continuous on $\Delta^2$ \footnote{I need to find the right reference for the result}.
%\end{example}


%\begin{example}
%Similarly, when we consider a problem with 3 types:
%\begin{itemize}
%\item $b=(b_1,b_2,b_3)$
%\item $\tau_1 >0$ , $\tau_2>0$, and $\tau_3>0$.
%\end{itemize}
%
%\begin{tiny}
%\begin{align}
% f(b) &= \frac{\tau \odot b}{\norm{\tau \odot b}_1} \\
% &= \frac{(\tau_1 \cdot b_1, \tau_2 \cdot b_2, \tau_3 \cdot b_3)}{\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3} \\
% \frac{\partial f}{\partial b_1} (b)
%  &= \left(
%  \frac{\tau_1 (\tau_2 b_2 + \tau_3 b_3)}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{-\tau_1 \tau_2 b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{-\tau_1 \tau_3 b_3}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2}
%  \right) \\
% \frac{\partial f}{\partial b_2} (b)
%  &= \left(
%  \frac{-\tau_2 \tau_1 b_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{\tau_2 (\tau_1 b_1 + \tau_3 b_3)}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{-\tau_2 \tau_3 b_3}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2}
%  \right) \\
% \frac{\partial f}{\partial b_3} (b)
%  &= \left(
%  \frac{-\tau_3 \tau_1 b_1}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{-\tau_3 \tau_2 b_2}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2},
%  \frac{\tau_3 (\tau_1 b_1 + \tau_3 b_3)}{(\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3)^2}
%  \right) \\
%  ||J_f(b)||_1 &= \max_{1 \leq j \leq 3} \sum_{1 \leq i \leq 3} |J_f(b)_{i,j}|
%\end{align}
%\end{tiny}
%Note that for every $b \in \Delta^3$,
%$\tau_1 \cdot b_1 + \tau_2 \cdot b_2 + \tau_3 \cdot b_3 > 0$.
%So, $||J_f(b)||_1$ is bounded for every $b \in \Delta^3$ if $\tau_1, \tau_2, \tau_3 > 0$.
%Thus, $f$ is Lipschitz continuous on $\Delta^3$.
%\end{example}
%

Moreover, in OAMDPs, belief transitions are assumed to be the Bayesian update using Equation~\ref{eq::belief_update}.
We can establish a relationship between the Lipschitz continuity of belief transitions and $\tau$ as follows:
\begin{restatable}{proposition}{proptau}
\label{prop::tau}
If
$\tau^{s, a, s'}(\theta) > 0 $ for $\forall \theta \in \Theta$, $s, s' \in S$, and $a \in A$, 
belief transitions are Lipschitz continuous.
\end{restatable}
\begin{proof}
See Appendix~\ref{sec:proofs}
\end{proof}
For example, using the noisy rational model (Equation~\ref{eq::noisy_rational}) ensures that  $\tau^{s, a, s'}(\theta) > 0 $, which guarantees the Lipschitz continuity of belief transitions.

\subsection{OASSPs}
We define an undiscounted, cost-based version of OAMDPs called OASSPs.
An OASSP is a tuple $\langle S, A, T, d_0, \Theta, b_0, \tau, C, G\rangle$ where $C: S \times A \times \Delta^{|\Theta|} \rightarrow \mathbb{R_{+}}$ is a belief-dependent cost function, and $G$ is a set of goal states.
The other components are the same as in OAMDPs.
An optimal policy for an OASSP is a policy that minimizes $\mathbb{E}[\sum_{t=0}^{\infty} C(S_t, A_t, B_t)|d_0, \pi]$.
As in OAMDPs, we assume that $C$ is a linear combination of the domain cost ($C_d$) and belief-dependent cost ($C_b$). That is, $C(s, a, b) = w_d C_d(s, a) + w_b C_b(s, a)$.
A domain SSP corresponding to an OASSP is an SSP defined as $M_d=\langle S, A, T, d_0, C_d, G\rangle$. 

\section{Approximation Algorithms}
In this section, we propose approximation algorithms for OAMDP/SSPs.
Our first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer's belief into regular grids. Our second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP), called Grid-RTDP.
Grid-RTDP relies on the same grid-based discretization scheme as Grid-VI, but focuses its updates on promising states using heuristic estimates. 

\subsection{Grid-Based Value Iteration for OAMDP/SSPs}
% \item The value function is not necessarily linear (unlike POMDP).

We first describe a grid-based value iteration algorithm for OAMDP/SSPs.
Grid-VI uses a set of regular grid points to approximate value functions.
A regular grid with the resolution $K$ is defined as:
%Let $K$ be a positive integer representing the resolution of the grid. The regular grid is defined as:
\begin{equation}
P_K = \Big\{ b = (\frac{1}{K}) k | k \in I^{|\Theta|}_+, \sum^{|\Theta|}_{i=1} k(i) = K\Big\},
\end{equation}
where $I^{|\Theta|}_+$ is the set of $|\Theta|$-vectors of non-negative integers.
$P_K$ divides $\Delta^{|\Theta|}$ into a set of equal-size sub-simplices.
Figure~\ref{img::triangulation} shows an example of a regular grid on $\Delta^3$ with $K=2$.

As in \cite{lovejoyComputationallyFeasibleBounds1991},
the value at a given belief point $b \in \Delta^{|\Theta|}$ is interpolated as
using the barycentric coordinates of $b$ with respect to $P_K(b)$:
%a convex combination of values at $P_K(b)$:
\begin{equation}
    V_{K}(s, b) = \sum_{b_i \in P_K(b)} \lambda_i V_K(s, b_i),
    \label{eq::convex_interpolation}
\end{equation}
where $P_K(b)$ is the corners of the sub simplex containing $b$, $\lambda_i \geq 0$, $\sum_{i=1}^{|\Theta|} \lambda_i = 1$, and $b=\sum_{i=1}^{|\Theta|} \lambda_i b_i$.
In Figure~\ref{img::triangulation}, the value at $b$ is interpolated using the values at $b_4$, $b_5$, and $b_6$.
For each iteration,
the algorithm updates values at all $s \in S$ and $b \in P_K$ 
using the Bellman optimality operator ($\mathcal{T}$):
\begin{small}
\begin{equation}
\label{eq::bellman}
(\mathcal{T}V_K)(s, b) = \max_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})\big],
\end{equation}
\end{small}
where values at $b \not \in P_K$ are interpolated using Equation~\ref{eq::convex_interpolation}.
%The resulting policy is obtained by one-step lookahead using values at given belief points: 
%\begin{equation}
%\pi_{K}(s, b) = \argmax_{a \in A} \big[R(s, a, b) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})\big].
%\end{equation}
%\end{small}
The resulting policy is obtained as: 
\begin{equation}
\pi_K(s, b, a) = 
\sum_{b_i \in P_K(b)} \lambda_i [a = \argmax_{a_i \in A} Q_K(s, b_i, a_i)],
\end{equation}
where $Q_K(s, b_i, a_i) = R(s, a_i, b_i) + \gamma \sum_{s' \in S} T(s,a,s') V_{K}(s', b^{s, a, s'})$.
That is, we take the optimal actions at the corners of sub-simplices proportional to the corresponding weights $\lambda_i$.

For problems with undiscounted objectives (OASSPs), 
Equation~\ref{eq::bellman} is replaced with minimizing costs without the discount factor.

\subsubsection*{Efficient Interpolation}
\begin{figure}[t!]
        \centering
        \includegraphics[width=\linewidth]{imgs/triangulation.pdf}
        \caption{An example of discretized belief points $P_K$ (right) with $K=2$ and $|\Theta|=3$. The left is the corresponding integer points ($P_K'$).}
        \label{img::triangulation}
\end{figure}
One key advantage of using a regular grid is that finding $\lambda$ is quite efficient.
To efficiently find barycentric coordinates of $b\in \Delta^{|\Theta|}$ with respect to ($P_K(b) \subset \Delta^{\Theta}$), we use a Freudenthal triangulation \citep{freudenthalSimplizialzerlegungenBeschrankterFlachheit1942}:
\begin{equation}
P_K' = \Big\{ q \in I^{|\Theta|}_+| K=q_1 \geq q_2 \geq \cdots \geq q_{|\Theta|} \Big\}.
\end{equation}
Note that, we have $|P_K'|=|P_K|=\frac{(K + |\Theta| - 1)!}{K! (|\Theta|-1)!}$.
Due to one-to-one correspondence between points in $P_K$ and $P_K'$, we can find a barycentric coordinate for 
$b\in \Delta^{|\Theta|}$ using a barycentric coordinate for the corresponding $v \in I^{|\Theta|}_+$ \citep{lovejoyComputationallyFeasibleBounds1991}.
As discussed by \cite{zhouImprovedGridbasedApproximation2001}, finding a sub-simplex can be done in $\mathcal{O}(|\Theta| \log |\Theta|)$ time.

\subsubsection*{Theoretical Guarantees}
We now discuss theoretical guarantees of Grid-VI.
\begin{proposition}
\label{prop::grid_vi_convergence_mdp}
For an OAMDP,
Grid-VI converges to the unique fixpoint $V_K^{*}$.
\end{proposition}
\begin{proof}
The interpolation (Equation~\ref{eq::convex_interpolation}) can be understood as an operator on the value function.
Let $\mathcal{A}_K$ be the corresponding operator, then our Grid-VI can be seen as repeatedly applying $(\mathcal{T}_{K} = \mathcal{A}_K \circ \mathcal{T})$
to the value function.
Since $\mathcal{A}_K$ is nonexpansion and $\mathcal{T}$ is contraction, 
$\mathcal{A}_K \circ \mathcal{T}$ is also a contraction, and Grid-VI converges to the unique fixpoint $V^{*}_K$ \citep{gordonStableFunctionApproximation1995}.
\end{proof}

%We know bound the error be
\begin{restatable}{lemma}{lemmaonestep}
\label{lemma::one_step}
For an OAMDP with Lipschitz-continuous value function with the constant $L_{V^{*}}$, one-step approximation errors using a regular grid with resolution $K$ are bounded as:
\begin{equation}
\|\mathcal{T}_K V^{*}- V^{*}\|_{\infty} \leq 
\frac{L_{V^{*}}}{K}.
\end{equation}
\end{restatable}

\begin{restatable}{proposition}{properrorbound}
\label{prop::grid_vi_error_bound}
For an OAMDP whose value function is $L_{V^{*}}$-Lipschitz continuous,
 we have:
\begin{equation}
\|V^{*} - V_K^{*}\|_{\infty} \leq 
\frac{L_{V^{*}}}{(1-\gamma)K} .
\end{equation}
\end{restatable}

\begin{proof}
See Section~\ref{sec:proofs}.
\end{proof}
Note that the right-hand sides go to $0$ as $K \rightarrow \infty$.

Next, we discuss a case where Grid-VI is applied to undiscounted problems (OASSPs). 
We first note that,
for an OASSP $M=\langle S, A, T, d_0, \Theta, b_0, \tau, C, G\rangle$,
Grid-VI for OASSPs implicitly defines an SSP $M_K = \langle S \times P_K, A, T, d_0^K, C_K, G_K \rangle$ where
% no blank line here
\begin{align}
    T_K(\langle s, b \rangle, a, \langle s', b_i \rangle) &= \begin{cases} 
        0 & b_i \neq P_K(b^{s, a, s'}),  \\
        \lambda_i T(s, a, s') & b^{s, a, s'} = \sum_i \lambda_i b_i,
    \end{cases} \\
    d_0^K(\langle s, b_i\rangle) &= 
    \begin{cases}
        0 & b_i \neq P_K(b_0), \\
        \lambda_i d_0(s) & b_0 = \sum_i \lambda_i b_i,
    \end{cases}\\
    C_K(s, a, b) &= C(s, a ,b),\\
    G_K &= G \times P_K.
\end{align}

The states in $M_K$ consist only of the corners of sub-simplices.
The transitions in $M_K$ are the same as in the original OASSP, except that, after the belief update, $b^{s, a, s'}$ is transitioned to one of the belief points $b_i \in P_K(b^{s, a, s'})$ surrounding it.
Note that, unlike the original OASSP, the number of belief states in $M_K$ is finite.


Since all $M$, $M_d$ and $M_K$ have the same dynamics in terms of domain state transitions, we have:
\begin{lemma}
    If $M_d$ has a proper policy, $M$ and $M_K$ also have at least one proper policy.
    \label{lemma::proper_policy}
\end{lemma}
\begin{proof}
Let $\pi_d$ be a proper policy for $M_d$.
Then $\pi(\langle s, b\rangle, a) = \pi_K(\langle s, b\rangle, a) = \pi_d(s, a)$ are proper policies for $M$ and $M_K$, respectively.
\end{proof}

For an SSP with a finite number of states, 
value iteration converges to the unique fixpoint as long as there is a proper policy~\citep{bertsekasAnalysisStochasticShortest1991}. Thus, we get:
\begin{proposition}
\label{prop::grid_vi_convergence_ssp}
If $M_d$ has a proper policy,
Grid-VI for OASSPs converges to the unique fixpoint $V^{*}_K$.
\end{proposition}


Our algorithm shares similarities with grid-based approximations for POMDPs \citep{lovejoyComputationallyFeasibleBounds1991,brafmanHeuristicVariableGrid1997,hauskrechtValuefunctionApproximationsPartially2000,zhouImprovedGridbasedApproximation2001,bonetEpsilonOptimalGridBasedAlgorithm2002}. 
The main difference is that the belief is over $\Theta$ in OAMDP/SSPs instead of over $S$ as in POMDPs.
Approximation using regular grids requires the number of points exponential to the dimension of belief vectors.
%This exponential requirement has been a major hurdle in applying grid-based approximation algorithms to real-world POMDPs, where the number of underlying states can be very large.
However, in most scenarios, it is reasonable to assume that the number of possible intentions
($|\Theta|$) is much smaller than the number of states.
Thus, having grid points exponential to the dimension of belief vectors is less of a constraint for OAMDP/SSPs.
%Secondly, grid points are needed for each
%$s \in S$ in OAMDPs, as the beliefs do not contain information about the domain states. 
%This necessitates a minor modification to the existing algorithms. 
%This requires a slight modification to the existing algorithms. Note that we can alternatively use $(|S| + |\Theta|)$-dimension belief vectors where the beliefs on $S$ is either $1$ or $0$.

Our Grid-VI for OAMDP/SSPs is a special case of grid-based value iteration for 
continuous MDPs  \citep{chowOptimalOnewayMultigrid1991,munosVariableResolutionDiscretization2002} in general.
One main difference is that, in OAMDP/SSPs, the continuous part of the state space ($\Delta^{|\Theta|}$) is guaranteed to be a simplex, which enables the efficient interpolation method.
Another difference is that, due to the structure of OAMDP/SSPs, discretization preserves the existence of a proper policy (Lemma~\ref{lemma::proper_policy}).

\subsection{Grid-basd Real-Time Dynamic Programming for OAMDP/SSPs}
We now propose an extension of Real-Time Dynamic Programming (RTDP) \citep{bartoLearningActUsing1995a} to OAMDP/SSPs, called Grid-RTDP.
The potential issue for Grid-VI is that it needs to update values at every state and grid points.
However, many of these points could be irrelevant in computing an optimal policy.
RTDP is an asynchronous value iteration algorithm that can converge to the optimal solution without having to consider the entire state space.
RTDP avoids exploring a portion of the state space by utilizing an admissible heuristic.
Our presentation in this section will be based on OASSPs.

%To adapt RTDP to solve OAMDPs, we use a simple discretization scheme ($d$) over beliefs over types:
%\begin{equation}
%d_K(b) = \left\lceil K \cdot b \right\rceil
%\end{equation}
%where $K$ is a positive integer representing the resolution of discretization, 
%and $\left\lceil \right\rceil$ is a ceiling function.
%For example,
%The discretization scheme is identical to the one used for adapting RTDP to POAMDPs, except that the beliefs are over $S$ in POMDPs but over $\Theta$ in OAMDPs.
%

%\mypara{RTDP for OASSPs}
Grid-RTDP discretizes beliefs into regular grids as in Grid-VI.
The value at a belief $b \in \Delta^{|\Theta|}$ is interpolated using Equation~\ref{eq::convex_interpolation}.
Algorithm~\ref{algorithm::rtdp} shows a pseudocode for Grid-RTDP.
The algorithm consists of repeated trials, where each trial starts from the initial state and belief of the observer.
During each trial, the algorithm first maps the current belief $b$ to one of the surrounding grid points $b_i \in P_K(b)$ randomly, where $b=\sum_{i=1}^{|\Theta|} \lambda_i b_i$.
Each $b_i$ has probability $\lambda_i$ of transitioning into (line~\ref{line::discretize}).
Then the algorithm selects an action that minimizes the current cost estimate to the goal $Q_K(s, b_i, a)$ (line~\ref{line::greedy}):
\begin{small}
\begin{align}
&Q_K(s, b, a) \\
&= C(s, a, b) + \sum_{s' \in S} T(s, a, s') V_K(s', b^{s, a, s'})\\
&= C(s, a, b) + \sum_{s' \in S} T(s, a, s') \sum_{b_i \in P_K(b^{s, a, s'})} \lambda_i V_K(s', b_i),
\end{align}
\end{small}
where $V_K$ is initialized with a given heuristic function $h$.
%There could be many possible heuristic functions.
In this paper, we consider the following two heuristic functions:
\begin{itemize}
\item $h_{0}$: which always returns $0$ (in other words, no heuristics), and
\item $h_{d}$: which returns the scaled optimal cost to go for the underlying domain cost ($w_d \cdot V_d^{*}(s)$).
\end{itemize}
Note that both $h_0$ and $h_d$ are admissible heuristics.
After selecting the best action $a^{*}$, the cost estimate for the current state ($V_K(s, b_i)$) is updated to  $Q_K(s, b_i, a^{*})$ (line~\ref{line::value_update}), the values are updated only at beliefs in $P_k$.
The next state is then sampled according to the dynamics of the environment (line~\ref{line::next_state}) and the belief of the observer is updated accordingly (line~\ref{line::belief_update}).
%The resulting policy is obtained by one-step lookahead: 
%\begin{equation}
%\pi_K(s, b) = \argmin_{a \in A} Q_K(s, b, a).
%\end{equation}
The resulting policy is obtained as: 
\begin{equation}
\pi_K(s, b, a) = 
\sum_{b_i \in P_K(b)} \lambda_i [a = \argmin_{a_i \in A} Q_K(s, b_i, a_i)].
\end{equation}
That is, we take the optimal actions at corners of subsimplices proportional to the corresponding weights $\lambda_i$.

\begin{algorithm}[t]
    \caption{Grid-RTDP}
    \begin{algorithmic}[1]
    \Function{Grid-RTDP}{}
        \While{within computational budget}
            \State $TRIAL(d_0, b_0)$
        \EndWhile
    \EndFunction
    \\
    \Function{Trial}{$d_0$, $b_0$}
        \State $s \sim d_0$
        \State $b \gets b_0$
        \While{episode continues}
            \State \textcolor{red}{sample $b_i \in P_K(b)$ with the weight $\lambda_i$} \label{line::discretize}
            \State $a^{*} \leftarrow  \min_a Q_K(s, b_i, a)$ \label{line::greedy}
            \State $V_K(s, b_i) \gets Q_K(s, b_i, a^{*})$ \label{line::value_update}
            \State $s' \sim \Pr(\cdot|s, a^{*})$ \label{line::next_state}
            \State $b \gets b_i^{s, a, s'}$ \label{line::belief_update}
        \EndWhile
    \EndFunction
    \end{algorithmic}
    \label{algorithm::rtdp}
\end{algorithm}

The algorithm is akin to RTDP-Bel \citep{bonetSolvingPOMDPsRTDPbel2009}, 
a version of RTDP developed for POMDPs.
Similar to Grid-RTDP,
RTDP-Bel is based on discretizing beliefs.
Let $d(b)$ be a discretization of $b$.
Unlike Grid-RTDP that updates the value at $d(b)$ using Q-values at $d(b)$,
RTDP-Bel updates the value at $d(b)$ using Q-values at $b$.
This can be a problem when two different belief points $b_1$ and $b_2$ discretizes to the same point ($d(b_1) = d(b_2))$, resulting in RTDP-Bel's oscillating behavior.
%However, RTDP-Bel does not have a convergence guarantee and may oscillate. This is because if two different belief points $b_1$ and $b_2$ are discretized to the same grid point, to update the value at  

\mypara{Properties}
We discuss some properties of Grid-RTDP.
When applied to SSPs, RTDP has the following guarantee:
%\begin{theorem}[\cite{bertsekasNeuroDynamicProgramming1996}]
%    If there exists a proper policy for an SSP $M$, every trial in RTDP for $M$ terminates in a finite number of steps.
%    \label{th::rtdp_finite_step}
%\end{theorem}

\begin{theorem}[\cite{bartoLearningActUsing1995a}]
    If there exists a proper policy for an SSP, the initial value is admissible, RTDP converges to the optimal value at relevant states.
    \label{th::rtdp_convergence}
\end{theorem}

We will now show that Grid-RTDP inherits the properties analogous to Theorems~\ref{th::rtdp_convergence} under the following conditions:
\begin{itemize}
    \item[A1] The domain SSP $M_d$ has a proper policy.
    \item[A2] The initial value estimates are admissible.
\end{itemize}


Combining Lemma~\ref{lemma::proper_policy} with Theorem~\ref{th::rtdp_convergence}, we get:
%\begin{proposition}
%    Under A1, every trial in RTDP for OAMDP terminates in a finite number of steps.
%    \label{prop::rtdp_finite_step}
%\end{proposition}

\begin{proposition}
    Under A1-2, Grid-RTDP converges to the optimal values ($V_K^{*}$) at relevant states.
    \label{prop::rtdp_convergence}
\end{proposition}

\subsubsection{Grid-Based Labeled RTDP for OASSPs}
We now propose labeled RTDP (LRTDP) \citep{bonetSolvingStochasticShortestPath2002} for OASSPs, called Grid-LRTDP.
The original
RTDP does not explicitly check for convergence, and can keep visiting states that are already solved, resulting in its slow convergence behavior.
LRTDP alleviates the issue by labeling those states as solved.
%Similar to RTDP, LRTDP for OAMDP uses a regular grid, and operates by mapping a belief $b$ to its neighbors $P_K(b)$.
The algorithm labels states as solved if residuals of Bellman updates in the states that could be visited under the current best policy are smaller than a given threshold.
Alternatively, Grid-LRTDP can be understood as applying LRTDP to $M_K$.
The pseudocode for the algorithm is available in the appendix (Appendix~\ref{sec::lrtdp_pseudocode}).

\section{Experiments}
We present experimental results solving OASSPs using the proposed algorithms.

\subsection{Domains}
We briefly describe the problem domains used in the experiments.
\begin{figure}[t!]
        \centering
        \begin{subfigure}[b]{0.3\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/blocks_world.drawio.png}
            \caption{BlocksWorld}
            \label{img::blocks_world}
        \end{subfigure}
        \hspace{10pt}
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/spelling.pdf}
            \caption{Acronym}
            \label{img::acronym}
        \end{subfigure}
        \caption{Problems}
\end{figure}

\mypara{MazeWorld}
Figure~\ref{img::baker5_403} shows an example of MazeWorld.
%where the agent can take 9 different actions: \emph{Stay, North, South, East, West, NorthEast, NorthWest, SouthEast} and \emph{SouthWest}. 
The agent's goal is to reach either one of the possible goals
$\{A,B, C, D, E\}$.
The domain costs  are proportional to the distance traveled.
To encourage being clear about the intention, $C_b$ is the TV distance from the target belief.
To make the problem more challenging, the agent can get transported to the initial state with the probability $0.1$ at each time step.
$w_d = 0.1$ and $w_b = 1.0$.

\mypara{BlocksWorld}
Figure~\ref{img::blocks_world} shows an example of Blocksworld from \cite{miuraUnifyingFrameworkObserveraware2021}, where the goal is to stack blocks to spell ``ARMS". Picking up a block always succeeds with probability $1$, while
putting down a block fails with probability $0.3$ (the block falls on the table).
Each domain action has a cost of $1$. 
$C_b$ is the TV distance from the target belief.
$w_d = 0.1$ and $w_b = 1.0$.
The optimal policy first stacks ``R" on top of ``S". This is not optimal in terms of task progression, but tells the observer that the goal ``ARMS" is more likely than ``RAMS".

%\mypara{MazeWorld with Explicit Communication}
%Figure shows an example of MazeWorld,
%where the agent can take 9 different actions: \emph{Stay, North, South, East, West, NorthEast, NorthWest, SouthEast} and \emph{SouthWest}. 
%The agent's goal is to reach either one of the possible goals
%$\{A,B, C\}$.
%The rewards for the actions are proportional to the negative distance traveled by the action.
%
\mypara{Acronym}
Figure~\ref{img::acronym} illustrates the Acronym domain.
There are four locations with letters.
The agent can move in eight different directions.
Once the agent is in the locations with letters, it can toggle the letters among $A \rightarrow M \rightarrow R \rightarrow S \rightarrow A$.
The potential goals are to spell ''ARMS", ''RAMS", or ''MARS" from top left to bottom right.
When toggling among letters,
there is $0.3$ probability of accidentally toggling too much.
The objective is spelling ''ARMS" while being ambiguous about the intention. 
$C_b(b) = H_{max} - H(b)$ where $H_{max}$ is the entropy of the uniform distribution and $H(b)$ represents the entropy of $b$.
$w_d = 0.5$ and $w_b = 1.0$.

\subsection{Offline Convergence}
We compare the following algorithms on the time before the maximum residual is smaller than $\epsilon=10^{-3}$:
\begin{itemize}
\item Grid-VI with $K=1, 4, 16$;
\item Grid-LRTDP with $K=1, 4, 16$ using $h_0$ and $h_d$.
\end{itemize}
Each run has time limit $10$m and memory limit $2$Gbytes.

Table~\ref{tab:covergence} shows the results.
Grid-LRTDP using $h_d$ was overall the best algorithm, generating fewer belief states to solve problems.
The exception was the MazeWorld domain, where, due to the random transition back to the initial state, Grid-(L)RTDP had to generate most of the belief points.
%The benefit of using the informative heuristic varied across problems.
While some problems required only coarse discretization of beliefs, other problems required finer discretization to compute near optimal policies.

\begin{table*}[ht]
\begin{small}
\begin{tabular}{ |c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|  }
 \hline
 Domain & $|\Theta|$ & K & \multicolumn{4}{|c|}{Grid-VI} & \multicolumn{4}{|c|}{Grid-LRTDP($h_0$)} & \multicolumn{4}{|c|}{Grid-LRTDP($h_d$)}\\
  &  &  &V & t(s) & $|S|$ & $|P_K|$ & V & t(s) & $|S|$ & $|P_K|$ & V & t(s) & $|S|$ & $|P_K|$ \\
 \hline
 \multirow{3}{*}{MazeWorld}   & \multirow{3}{*}{$5$} &   1 & 19.15 & 5.32 & 148 & 740 & 18.9 & 2.65 & 148 & 740 & 19.05 & 3.28 & 148 & 606 \\
    &   & 4 & 16.69 & 167.41 & 148 & 10360 & 16.60 & 155.22 & 148 & 10157 & 16.67 & 198.60 & 148 & 9419 \\
    &   & 16 & - & - & - &-&-&-&-&-& -&-&-&-\\
 \hline
 \multirow{3}{*}{Acronym}   & \multirow{3}{*}{$3$}  
          & 1 & 15.69 & 13.36 &  6379 & 19137 & 15.71 & 4.62 & 6379 & 19137 & 15.86 & 5.03 &  6379 & 19137 \\
    &  & 4 & 8.41 & 121.28 & 6379 & 95685 & 10.27 & 39.38 &6379 & 89116 &10.49 & 19.04 & 6379 & 40480 \\
    &  & 16 & - & - & - & - & 8.38 & 208.23 & 6379 & 973053 & 8.37 & 10.02 & 6292 & 43476\\
    %fb3
 \hline
 \multirow{3}{*}{BlocksWorld}   & \multirow{3}{*}{$2$}  
          & 1 & 3.57 & 2.2 & 125 & 250 & 3.57 & 2.8 & 125 & 250 & 3.57 & 1.1 & 125 & 134 \\
    &     & 4 & 3.04 & 4.60 & 125 &625 & 3.36 & 3.52 & 125 & 542 & 3.03 & 2.73 & 124 & 387 \\
    &    & 16 & 3.03 & 15.48 & 125 & 2125 & 3.03 & 16.076 & 125 & 1692 & 3.03 & 11.45 & 124 & 1103 \\
 \hline
\end{tabular}
\end{small}
\caption{ Time until convergence for different algorithms. $V$ represents the value when the policy is evaluated under the true environment ($M$). $t(s)$ is the running time in seconds. $|S|$ and $|P_K|$ represent the number of generated domain and belief states, respectively.}
\label{tab:covergence}
\end{table*}

\subsection{Anytime Performance}
We compare the following algorithms in terms of the anytime behaviors:
\begin{itemize}
\item Grid-(L)RTDP with $K=4, 8$ using $h_d$;
\item UCT where the rollout policy $\pi^{*}_d$ is an optimal policy for the domain SSP.
\end{itemize}

Each algorithm was run for $10^2$, $10^3$, $5 \cdot 10^3$, $10^4$, $5 \cdot 10^4$, $10^5$, $5 \cdot 10^5$, $10^6$ Grid-(L)RTDP/UCT trials.
For UCT, the specified number of trials are performed at each timestep online.
For Grid-(L)RTDP, the trials are performed offline.
Each run has time limit $10$m and memory limit $2$Gbytes.
Figure~\ref{img::anytime} shows the results.
UCT and Grid-(L)RTDP exhibited performances that complement each other. While UCT showed better anytime performance in Acronym, it took some time to achieve good performance in Blocks World, a small problem instance with $|\Theta|=2$.
Comparing Grid-(L)RTDP with different resolutions ($K$), using coarser grids generally resulted in better anytime behaviors as long as the resolution is sufficient.
Between Grid-RTDP and Grid-LRTDP, they exhibited comparable anytime behaviors.

\begin{figure*}[t!]
        \centering
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/reset5_online_302.png}
            \caption{MazeWorld}
		\label{img::trace}
        \end{subfigure}
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/spelling_online_8.png}
            \caption{Acronym}
		\label{img::trace}
        \end{subfigure}
        \begin{subfigure}[b]{0.32\linewidth}
            \centering
            \includegraphics[width=\linewidth]{imgs/blocks_online_1.png}
            \caption{BlocksWorld}
		\label{img::blocks_online}
        \end{subfigure}
        \caption{Anytime behaviors for different algorithms.}
		\label{img::anytime}
\end{figure*}

%\begin{figure}[t!]
%    \centering
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/baker_101.png}
%            \caption{Grid VI ($K=10$)}
%		\label{img::trace}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/mcts_baker_101.png}
%            \caption{UCT with $5000$ iterations}
%		\label{img::trace}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/belief_changes_baker_101.png}
%            \caption{Belief Changes (Grid VI)}
%		\label{img::belief_change_without}
%        \end{subfigure}
%        \begin{subfigure}[b]{0.4\linewidth}
%            \centering
%            \includegraphics[width=\linewidth]{imgs/mcts_belief_changes_baker_101.png}
%            \caption{Belief Changes (UCT)}
%		\label{img::belief_change_without}
%        \end{subfigure}
%	\caption{Example of combining implicit and explicit communication in the Maze World environment ($\beta = 0.3$).}
%	\label{img::grid_world}
%\end{figure}

\section{Related Work}
OAMDP is a framework unifying different kinds of observer-aware behaviors.
\emph{Legible} behavior \citep{draganGeneratingLegibleMotion2013,miuraMaximizingLegibilityStochastic2021}, which implicitly conveys intentions via the choice of actions.
Similarly, \emph{explicable} behaviors \citep{zhangPlanExplicabilityPredictability2017} conform to observers' expectations.
\emph{Deceptive} behaviors \citep{draganDeceptiveRobotMotion2015,mastersDeceptivePathPlanning2017} hide agents' intentions or actively deceive observers.
\emph{Predictable} behaviors enable observers to predict future actions \citep{fisacGeneratingPlansThat2020}.
Agents can also express their \emph{(in)capability} via the choice of their actions \citep{kwonExpressingRobotIncapability2018}.
%While there have been several attempts to combine different kinds of observer-aware behaviors~\cite{draganGeneratingLegibleMotion2013,strouseLearningShareHide2018,chakrabortiBalancingExplicabilityExplanations2019,kulkarniUnifiedFrameworkPlanning2019}, there is no unifying framework that reveals the relationships among the approaches and the complexity of the problem.
%

OAMDP could be regarded as a special case of Decision Process with non-Markovian Reward (NMRDP) \citep{bacchusRewardingBehaviors1996,thiebauxDecisionTheoreticPlanningNonMarkovian2006}. Unlike OAMDPs, existing works on NMRDPs \cite{bacchusRewardingBehaviors1996,thiebauxDecisionTheoreticPlanningNonMarkovian2006,brafmanLTLfLDLfNonMarkovian2018} utilize temporal logic to describe rewards over histories. OAMDP, on the other hand, employs the belief of the observer to capture the non-Markovian nature of rewards.

OAMDPs are related to the line of work that reasons about the belief of other agents.
In particular, OAMDPs can be seen as a restricted subset of Interactive POMDPs \citep{gmytrasiewiczFrameworkSequentialPlanning2005}, where agents act by recursively modeling the other agents' beliefs \citep{miuraUnifyingFrameworkObserveraware2021}.
In game theory,
psychological games deal with utility that depends on the belief of the other agent \citep{battigalliBeliefDependentMotivationsPsychological2022b}.
Epistemic game theory \citep{pereaEpistemicGameTheory2012} also explicitly reasons about the belief of the other agent.

\section{Conclusion}
In this paper, we propose the first approximation algorithms for solving OAMDP/SSPs, Grid-VI and Grid-(L)RTDP.
Both of the algorithms are
 based on discretizing the observer's beliefs into regular grids.
To justify the proposed algorithms, we show that the domain state and the belief of the observer constitute a sufficient statistics for OAMDPs (Proposition~\ref{prop::sufficient_statistics}).
Furthermore, we show that both algorithms converge to the unique value (Proposition~\ref{prop::grid_vi_convergence_mdp}, \ref{prop::grid_vi_convergence_ssp}, and \ref{prop::rtdp_convergence}) and provide performance guarantees under the standard assumptions (Propositions~\ref{prop::grid_vi_error_bound} and \ref{prop::rtdp_convergence}).
Our experimental results show that the proposed algorithms can compute near-optimal policies for OAMDP/SSPs.
In particular, Grid-(L)RTDP can converge to a solution faster than Grid-VI and has anytime performance competitive with UCT.
%\clearpage % [olivier] not in UAI template
%\bibliographystyle{named}
\bibliography{main}

\appendix
\section{Proofs}
\label{sec:proofs}

To prove Proposition~\ref{prop::lipschitz}, we first prove the Lipschitz continuity of $n$-step value function.
Let $V^{(0)}(s, b) = 0$ and 
$V^{(n+1)}(s, b) = \max_{a} R(s, a, b) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b^{s, a, s'})$.
Then we have:
\begin{lemma}
For a $(L_r, L_p)$-Lipschitz OAMDP,
$V^{(n)}$ is $L_{V^{(n)}}$-Lipschitz continuous, where $L_{V^{(n)}}$ satisfies:
\begin{equation}
L_{V^{(n+1)}} = L_r + \gamma L_p L_{V^{(n)}}
\end{equation}
\end{lemma}

\begin{proof}
Proof by induction on $n$. For the base case with $n=1$, 
\begin{align}
&|V^{(1)}(s, b_1) - V^{(1)}(s, b_2)| \\
&= |\max_a R(s, a, b_1) - \max_a R(s, a, b_2)| \\
&\leq \max_a |R(s, a, b_1) - R(s, a, b_2)| \\
&\leq L_r \|b_1 - b_2\|_{\infty}
\end{align}

For the induction step,
\begin{align}
&|V^{(n+1)}(s, b_1) - V^{(n+1)}(s, b_2)| \\
&= |\max_a R(s, a, b_1) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_1^{s, a, s'})\\
&- \max_a R(s, a, b_2) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq \max_a |R(s, a, b_1) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_1^{s, a, s'})\\
&- R(s, a, b_2) + \gamma \sum_{s'} T(s, a, s') V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq \max_a |R(s, a, b_1) - R(s, a, b_2)| \\
&+\gamma \sum_{s'} T(s, a, s') |V^{(n)}(s', b_1^{s, a, s'}) - V^{(n)}(s', b_2^{s, a, s'})|\\
&\leq (L_r + \gamma L_p L_{V^{(n)}}) \|b_1 - b_2\|_{\infty}
\end{align}
\end{proof}

\proplipschitz*

\begin{proof}
Consider a sequence $\{L_n\}_{n \geq 1}$ where $L_1=L_r$ and:
\begin{equation}
L_{n+1} = L_r + \gamma L_p L_n
\end{equation}
Then,
\begin{align}
L_{n} &= L_r + \gamma L_p L_r + (\gamma L_p)^2 L_r + \cdots + (\gamma L_p)^{n-1} L_r \\
&= \frac{1 - (\gamma L_p)^n }{1-\gamma L_p} L_r
\end{align}
By our assumption, $\gamma L_p < 1$, so the sequence converges.
Let $L_{V^{*}} = \lim_{n \rightarrow \infty} L_n$.
$L_{V^{*}}$ must satisfy $L_{V^{*}} = L_r + \gamma L_p L_{V^{*}}$. Thus, we get Equation~\ref{eq::L_V}.
\end{proof}

%\begin{proposition}
%In an OAMDP with
%$\tau^{s, a, s'}(\theta) > 0 $ for $\forall \theta \in \Theta$, $s, s' \in S$, and $a \in A$, 
%belief transitions are Lipschitz continuous.
%\end{proposition}
\proptau*

\begin{proof}
Let $f^{s, a, s'}(b) = b^{s, a, s'} : \Delta^{\Theta} \rightarrow \Delta^{\Theta}$ be the belief transition after observing $\langle s, a, s' \rangle$. 
From the definition (Equation~\ref{eq::belief_update}),
$f^{s, a, s'}(b)(\theta_i) = \frac{\tau^{s, a, s'}_i b_i}{\sum_k \tau^{s, a, s'}_k b_k}$, where $\tau^{s,a,s'}_i = \tau^{s,a,s'}(\theta_i)$ and $b_i = b(\theta_i)$.
Then we have:
\begin{align}
J_{f^{s, a, s'}}(b)_{i, j} &= \begin{cases}
    \frac{\tau^{s, a, s'}_i (\sum_{k \neq i} \tau^{s, a, s'}_k b_k)}{(\sum_{k} \tau^{s, a, s'}_k b_k)^2} & i = j,\\
\frac{-\tau^{s, a, s'}_i \tau^{s, a, s'}_j b_j}{(\sum_{k} \tau^{s, a, s'}_k b_k)^2} & i \neq j,
\end{cases}\\
\|J_{f^{s, a, s'}}(b)\|_{\infty} &= \max_{1 \leq i \leq n} \sum_{1 \leq j \leq n} |J_{f^{s, a, s'}}(b)_{i,j}|,\\
&= \max_{1 \leq i \leq n} \frac{2 \tau^{s, a, s'}_i(\sum_{k \neq j} \tau^{s, a, s'}_k b_k) }{(\sum_k \tau^{s, a, s'}_k b_k)^2},
\end{align}
where $J_f$ is the Jacobian of $f$ and $\|\cdot\|_{\infty}$ is the induced operator norm.
%Since the OAMDP is well-behaved,
%we have $\tau_k > 0$ for all $k=1,\cdots,n$.
Let 
$\tau_{\min} = \min_{s,a,s',k} \tau_k^{s,a,s'}$ 
and
$\tau_{\max} = \max_{s,a,s',k} \tau_k^{s,a,s'}$ .
Note that,
for every $b \in \Delta^n$, 
$\sum_{k\neq i} \tau^{s, a, s'}_k b_k \leq \tau_{\max}$ and 
$\sum_{k} \tau^{s, a, s'}_k b_k \geq \tau_{\min}> 0$. Then we get $\|J_{f^{s, a, s'}}(b)\|_{\infty} \leq 2 (\frac{\tau_{\max}}{\tau_{\min}})^2$.
\end{proof}

\lemmaonestep*
%\begin{lemma}
%For an OAMDP with $L_{V^{*}}$-Lipschitz continuous value function, we have the following bound on one-step approximation errors using a regular grid with resolution $K$:
%\begin{equation}
%\|\mathcal{T}_K V^{*}- V^{*}\|_{\infty} \leq 
%\frac{L_{V^{*}}}{K}
%\end{equation}
%\end{lemma}

\begin{proof}
For all $n \geq 0, K \geq 1$, $s \in S$, and $b \in \Delta^{|\Theta|}$,
\begin{align}
&| V^{*}(s,b) - \mathcal{T}_K V^{*}(s, b)| \\
%&=  | V^{*}(s,b) - (\mathcal{A}_K \circ \mathcal{T})(V^{*})(s, b_i)| \text{ (by definition) }\\
&=  | V^{*}(s,b) - \sum_{b_i \in P_K(b)} \lambda_i \mathcal{T}V^{*}(s, b_i)| \text{ (by definition) }\\
&=  | \sum_{b_i \in P_K(b)} \lambda_i (V^{*}(s,b) - V^{*}(s, b_i))| \text{ ($\mathcal{T}$ is a fixpoint of $V^{*}$) } \\
&\leq  \sum_{b_i \in P_K(b)} \lambda_i |V^{*}(s,b) -  V^{*}(s, b_i)| \text{ (triangle inequality) } \\
&\leq  \sum_{b_i \in P_K(b)} \lambda_i L_{V^{*}} \|b - b_i\|_{\infty} \\
&\leq L_{V^{*}} \frac{1}{K}
\end{align}
\end{proof}

\properrorbound*
%\begin{theorem}
%Let $\epsilon_K=(L_r + \gamma L_p L_{V^{*}})\frac{1}{K}$ be the one-step approximation error with resolution $K$, we have:
%\begin{equation}
%\|V^{*} - V_K^{*}\|_{\infty} \leq 
%\frac{\epsilon_K}{(1-\gamma)} 
%\end{equation}
%\end{theorem}

\begin{proof}
\begin{align}
&\|V^{*} - V_K^{*}\|_{\infty} \\
&\leq \|V^{*} - \mathcal{T}_K V^{*} + \mathcal{T}_K V^{*} - V_K^{*}\|_{\infty} \\
&\leq \|V^{*} - \mathcal{T}_K V^{*}\|_{\infty} + \|\mathcal{T}_K V^{*} - \mathcal{T}_K V_K^{*}\|_{\infty} \\
&\leq \frac{L_{V^{*}}}{K} + \gamma \|V^{*} - V_K^{*}\|_{\infty}
\end{align}
\end{proof}

%\begin{proof}
%\cite{munosPerformanceBoundsL_p2007}
%\begin{align}
%&||V^{*} - V^{\pi_{n, K}}||_{\infty} \\
%&= ||\mathcal{T}V^{*} -  \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty}\\
%&= ||\mathcal{T}V^{*} - \mathcal{T}^{\pi_{n, K}}V_{n,K} 
% +\mathcal{T}^{\pi_{n, K}}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&= ||\mathcal{T}V^{*} - \mathcal{T}V_{n,K} 
% +\mathcal{T}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&\leq ||\mathcal{T}V^{*} - \mathcal{T}V_{n,K}||_{\infty} 
% + ||\mathcal{T}^{\pi_{n, K}}V_{n,K} - \mathcal{T}^{\pi_{n, K}}V^{\pi_{n, K}}||_{\infty} \\
%&\leq \gamma ||V^{*} - V_{n,K}||_{\infty} 
% + \gamma ||V_{n,K} - V^{\pi_{n, K}}||_{\infty}\\
%&\leq \gamma ||V^{*} - V_{n,K}||_{\infty} 
% + \gamma ||V_{n,K} - V^{*}||_{\infty} + ||V^{*} - V^{\pi_{n, K}}||_{\infty}
%\end{align}
%
%Thus, we get:
%\begin{equation}
%||V^{*} - V^{\pi_{n, K}}||_{\infty} \leq \frac{2 \gamma }{1-\gamma}||V^{*} - V_{n, K}||_{\infty}
%\end{equation}
%Moreover,
%\begin{align}
%&||V^{*} - V_{n+1, K}||_{\infty} \\
%&\leq ||\mathcal{T}V^{*} - \mathcal{T}V_{n, K}||_{\infty}
%+  ||\mathcal{T}V_{n, K} - V_{n+1, K}||_{\infty} \\
%&\leq \gamma ||V^{*} - V_{n, K}|| + \epsilon_K
%\end{align}
%By taking the upper limit, we get:
%\begin{equation}
%\limsup_{n \rightarrow \infty }||V^{*} - V_{n, K}|| \leq \epsilon_K / (1 - \gamma)
%\end{equation}
%\end{proof}

\section{Pseudocode for Grid-LRTDP}
\label{sec::lrtdp_pseudocode}
Algorithm~\ref{algorithm::lrtdp} shows the pseudocode for Grid-LRTDP.
The algorithm operates identically to Grid-RTDP, except that at the end of each trial, the algorithm checks if states visited during the trial can be labeled as solved.
\begin{algorithm}
    \label{algorithm::lrtdp}
    \caption{Grid-LRTDP}
    \begin{algorithmic}[1]
    \Function{Grid-LRTDP}{$s_0$, $b_0$, $\epsilon$, $K$}
        \While{$\exists b_i \in P_K(b_0) \neg \langle s_0, b_0\rangle.solved$}
            \State LRTDPTRIAL($s_0$, $b_0$, $\epsilon$, $K$)
        \EndWhile
    \EndFunction
    \\
    \Function{LRTDPTRIAL}{$s_0$, $b_0$}
        \State $visited \gets Stack::new()$
        \State $s \sim s_0$
        \State $b \gets b_0$
        \While{episode continues}
            \State \textcolor{red}{sample $b_i \in P_K(b)$ with the weight $\lambda_i$} 
            \State $visited.push(\langle s, b_i \rangle)$ 
            \State $a^{*} \leftarrow  \min_a Q_K(s, b_i, a)$ 
            \State $V_K(s, b_i) \gets Q_K(s, b_i, a^{*})$ 
            \State $s' \sim \Pr(\cdot|s, a^{*})$ 
            \State $b \gets b_i^{s, a, s'}$ 
        \EndWhile
        \\
        \While{$\neg visited.is\_empty()$}
            \State $\langle s, b \rangle \leftarrow visited.pop() $
            \If{$\neg$ CHECKSOLVED($s$, $b$, $\epsilon$, $K$)}
                \State \textbf{break}
            \EndIf
        \EndWhile
    \EndFunction
    \end{algorithmic}
    \label{algorithm::lrtdp}
\end{algorithm}

Algorithm~\ref{algorithm::check_solved} shows the procedure for labeling states.
Starting from a given $\langle s, b \rangle$ the algorithm visits state that could be visited under the current best policy, and checks if the residuals of Bellman updates are smaller than a given threshold $\epsilon$.
\begin{algorithm}
    \caption{CHECKSOLVED}
    \label{algorithm::check_solved}
    \begin{algorithmic}[1]
    \Function{CHECKSOLVED}{$s$, $b$, $\epsilon$, $K$}
        \State $rv \gets true$
        \State $open \gets Stack::new()$
        \State $closed \gets Stack::new()$
        \If{$\neg \langle s, b \rangle.solved$}
            \State $open.push(\langle s, b \rangle)$
        \EndIf
        \While{$\neg open.is\_empty()$}
            \State $\langle s, b \rangle \leftarrow open.pop()$
            \State $closed.push(\langle s, b \rangle)$
            \State $a^{*} \leftarrow  \min_a Q_K(s, b, a)$ 
            \State $\epsilon_{res} \leftarrow |V_K(s, b) - Q_K(s, b, a^{*}|$
            \State $V_K(s, b) \gets Q_K(s, b, a^{*})$ \label{line::value_update}
            \If{$\epsilon_{res} < \epsilon$}
                \State \textbf{continue}
            \EndIf
            \ForAll{ $s' \in S$ such that $T(s, a, s') > 0$}
                \ForAll{ $b_i \in P_K(b^{s, a, s'})$ such that $\lambda_i > 0$}
                    \If{$\neg \langle s', b_i\rangle.solved \wedge \neg \langle s', b_i\rangle \in open \wedge \neg \langle s', b_i\rangle \in closed $}
                        \State $open.push(\langle s', b_i \rangle)$
                    \EndIf
                \EndFor
            \EndFor
        \EndWhile
        \\
        \If{$rv = true$}
            \ForAll{$\langle s, b \rangle \in closed$}
                \State $\langle s, b \rangle.solved \leftarrow true$
            \EndFor
        \Else
            \While{$\neg closd.is\_empty()$}
                \State $\langle s, b \rangle \leftarrow open.pop()$
                \State $a^{*} \leftarrow  \min_a Q_K(s, b, a)$ 
                \State $V_K(s, b) \gets Q_K(s, b, a^{*})$ \label{line::value_update}
            \EndWhile
        \EndIf
    \EndFunction
    \end{algorithmic}
\end{algorithm}

\end{document}