%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
\usepackage{bibentry}
\newtheorem{definition}{Definition}
\newtheorem{Lemma}{Lemma}
\newtheorem{Assumption}{Assumption}
\newtheorem{example}{Example}
\newtheorem{corollary}{Corollary}
\newtheorem{Problem}{Problem}
\newtheorem{theorem}{Theorem}
\newtheorem{proof}{Proof}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{proposition}{Proposition}
% \usepackage{algorithm}
\usepackage[noend]{algorithmic}
\usepackage{newfloat}
\usepackage{listings}
\usepackage{amssymb}
\usepackage{wrapfig}
\usepackage[ruled,vlined]{algorithm2e}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Conditional Abstraction Trees for Sample-Efficient Reinforcement Learning}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Mehdi Dadvar}
\author[1]{Rashmeet Kaur Nayyar}
\author[1]{Siddharth Srivastava}
% Add affiliations after the authors
\affil[1]{%
    Arizona State University\\
    Tempe, Arizona, USA
}


  \begin{document}
\maketitle


\newcommand{\alg}{CAT+RL}

\begin{abstract}
In many real-world problems, the learning agent needs to learn a problem’s abstractions and solution simultaneously. However, most such abstractions need to be designed and refined by hand for different problems and domains of application. This paper presents a novel top-down approach for constructing state abstractions while carrying out reinforcement learning (RL). Starting with state variables and a simulator, it presents a novel domain-independent approach for dynamically computing an abstraction based on the dispersion of temporal difference errors in abstract states as the agent continues acting and learning. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns semantically rich abstractions that are finely-tuned to the problem, yield strong sample efficiency, and result in the RL agent significantly outperforming existing approaches.
\end{abstract}

\section{Introduction}
It is well known that \emph{good abstract representations} can play a vital role in improving the scalability and efficiency of reinforcement learning (RL)~\citep{sutton2018reinforcement,yu2018towards,konidaris2019necessity}. However, it is not very clear how good abstract representations could be efficiently learned without extensive hand-coding. Several approaches~\citep{kocsis2006bandit, anand2015asap, jiang2014improving} have investigated methods for aggregating concrete states based on similarities in value functions but this approach can be difficult to scale as the number of concrete states or the transition graph grows.

This paper presents a novel approach for top-down construction and refinement of abstractions for sample-efficient reinforcement learning in factored, non-image-based domains. Such problems include several practical applications (e.g., a taxi-management service), where the state is naturally expressed in terms of values of different variables. Translating such states into images would require extensive human effort.
Our approach starts with a default, auto-generated coarse abstraction that collapses the domain of each state variable (e.g., the location of each taxi and each passenger in the classic taxi world)  to one or two abstract values. This eliminates the need to consider concrete states individually, although this initial abstraction is likely to be too coarse for most practical problems. The overall algorithm proceeds by interleaving the process of refining this abstraction with learning and evaluation of policies, and results in a new form of conditional abstraction that is automatically generated and changes based on the current state to aid learning.

Extensive empirical evaluation on a range of well-established discrete and continuous challenging problems drawn from state-of-the-art RL research~\citep{icarte2018using, abel2020value, jin2022creativity, barreto2020fast} show that \emph{this approach for learning conditional abstractions enables vanilla Q-learning to outperform state-of-the-art baselines} by significantly improving its sample efficiency. In the process, it also learns well-defined abstract representations and draws out similarities across the state space. Furthermore, we found that this approach requires significantly less hyperparameter tuning in comparison to many of the baselines.

Our approach is related to research on variable resolution abstractions for reinforcement learning and abstraction refinement in model checking~\citep{moore1991variable,clarke2000counterexample,dams2018abstraction}. 
However, unlike existing streams of work, we develop a process that automatically generates semantically rich conditional abstractions, where the final abstraction on the set of values of a variable can depend on the specific values of other variables. For instance, consider a taxi-world problem (Fig.\,\ref{fig:grid-mdp-taxi}). Ideally, when the taxi needs to pick up the orange passenger, a good abstraction would preserve precision in regions closer to the passenger and blur out states where the taxi has a similar policy (Fig. \ref{fig:grid-mdp-taxi} (middle)). However, when the passenger is in the taxi the abstraction should \emph{change} to increase precision around the destination to the extent required to express the taxi's policy for dropping off the passenger (Fig. \ref{fig:grid-mdp-taxi} (right)). In other words, the abstraction on a variable's values (such as the taxi's location) needs to be contingent on the values of the other variables (such as the passenger's presence in the taxi). 

To our knowledge, this constitutes the first model-free approach for learning such conditional abstractions on-the-fly while carrying out abstract RL.
Our key contributions are (a) formalization and algorithms for building well-defined conditional abstraction trees (CATs) that help compute and represent such abstractions, as well as (b) an algorithm for interleaving RL with CAT learning. While CAT learning could be utilized in numerous RL paradigms, this paper focuses on developing and investigating it for non-image-based domains with discrete actions.

This process also addresses a key challenge in planning with abstractions: it is well known that abstractions of Markovian transition systems such as MDPs are often non-Markovian~\cite{singh1994reinforcement,bai2016markovian,srivastava2016metaphysics}. Intuitively, this is related to the fact that different concrete states represented by an abstract state will have different optimal actions and Q functions. More precisely, the next abstract state depends in general on the agent's current concrete state, whose distribution can depend on the entire action history rather than on only the current abstract state and the current action. To address these problems, CAT+RL carries out RL in the abstract state space but when it observes a high dispersion of temporal difference (TD) errors during Q-learning CAT+RL selectively refines the abstraction, thereby reducing the extent of relevant non-Markov transitions in the abstract state space. In the worst case, this process can lead to a full concretization for discrete state spaces but substantial information is carried over across refinements and the approach turns out to be highly sample efficient in practice. We leave further analysis of this aspect for future work and focus on the core CAT+RL algorithm in this paper. 

The presented approach for Conditional Abstraction Trees for RL (\alg) can be thought of as a dynamic abstraction
scheme:
% because the refinement is tied to the dispersion of temporal difference (TD) errors based on the agent's evolving policy during learning. 
it provides adjustable degrees of compression \citep{abel2016near} where the aggressiveness of abstraction can be controlled by tuning the definition of variation in the dispersion of TD errors. 


\begin{figure}[t]
\centering
\includegraphics[scale=0.13]{images/intuition}
\caption{\small Consider a classic taxi world with two passengers and a building as the drop-off location where the green area is impassable (left). Meaningful conditional abstractions can be constructed, for example, for situations where both passengers are at their pickup locations (middle), or one passenger has already been picked up (right).}
\label{fig:grid-mdp-taxi}
\end{figure}
   

\section{Related Work}
\label{sec:related-work}
\textbf{Abstraction Refinement}
Several authors have considered variable resolution abstractions and abstraction refinement for RL (e.g., \citep{moore1991variable, uther1998tree, whiteson2010adaptive}). Later work by \citet{seipp2018counterexample} developed the concept for classical planning. However, it has remained unclear how to formalize and develop this principle in a manner that provides scalability and sample efficiency in stochastic settings. For instance, \cite{uther1998tree} employ decision-tree techniques to categorize concrete transition histories rather than creating abstract states. As a consequence, this approach requires a large number of concrete samples for finding a \textit{good} split using multiple sort and search operations on concrete transitions.  \citet{whiteson2010adaptive}  used the variation in state values as a split metric for tile-based representations of abstract states. This approach requires a  deterministic model of the world and needs to keep track of the Q-values of all possible refinements of an abstract state. Additionally, the non-exclusivity of sub-tiles considered during refinement leads to additional computation for sub-tiles that may not be used. 

The approach presented in this paper addresses these longstanding problems and develops a well-defined formalization that enables dynamic, variable-resolution abstractions for RL. It achieves this by developing the CAT data structure to keep track of heterogeneous abstractions and uses the CAT to define a purely abstract RL process that runs in concert with dispersion-guided abstraction refinement for stochastic settings.
CATs enable \alg{} to identify abstract states with the greatest TD dispersion and provide the useful property that all children of an abstract state in the CAT are mutually exclusive and exhaustive. This makes \alg{}'s RL process more efficient and scalable. 

\textbf{Offline State Abstraction} Most early studies focus on action-specific \citep{dietterich1999state} and option-specific \citep{jonsson2000automated} state abstraction. Further, \citet{givan2003equivalence} introduced the notion of state equivalence to possibly reduce the state space size by which two states can be aggregated into one abstract state if applying a mutual action leads to equivalence states with similar rewards. \citet{ravindran2004approximate} relaxed this definition of state equivalence by allowing the actions to be different if there is a valid mapping between them. Offline state abstraction has further been studied for generalization and transfer in RL \citep{karia2022relational} and planning \citep{srivastava2012applicability,karia2022learning}.

\textbf{Graph-Theoretic State Abstraction} \citet{mannor2004dynamic} developed a graph-theoretic state abstraction approach that utilizes the topological similarities of a state transition graph (STG) to aggregate states in an online manner. Mannor's definition of state abstraction follows Givan's notion of equivalence states except they update the partial STG iteratively to find the abstractions. Another comparable method by \citet{chiu2010automatic} carries out spectral graph analysis on STG to decompose the graph into multiple sub-graphs. However, most graph-theoretic analyses on STG, such as computing the eigenvectors in \citeauthor{chiu2010automatic}'s work, can become infeasible for problems with large state spaces.  

\textbf{Monte-Carlo Tree Search (MCTS)} MCTS approaches offer viable and tractable algorithms for large state-space Markovian decision problems \citep{kocsis2006bandit}. \citet{jiang2014improving} demonstrated that proper abstraction effectively enhances the performance of MCTS algorithms. However, their clustering-based state abstraction approach is limited to the states enumerated by their algorithm within the partially expanded tree, which makes it ineffectual when limited samples are available to the planning/learning agent. \citet{anand2015asap} advanced Jiang's method by comprehensively aggregating states and state-action pairs aiming to uncover more symmetries in the domain. Owing to their novel state-action pair abstraction extending Givan and Ravindran's notions of abstractions, \citeauthor{anand2015asap}'s method results in higher quality policies compared to other approaches based on MCTS. However, their bottom-up abstraction scheme makes their method computationally vulnerable to problems with significantly larger state space size. Moreover, their proposed state abstraction method is limited to the explored states since it applies to the partially expanded tree.


\section{Background}
\label{sec:back}
Markov decision Processes (MDPs)~\citep{bellman1957markovian,puterman2014markov} are defined as a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{\gamma}\rangle$, where $\mathcal{S}$ and $\mathcal{A}$ denote the state and action spaces respectively. Generally, a concrete state $s \in \mathcal{S}$ can be defined as a set of $n$ state variables such that $\mathcal{V} = \{ v_i |  i = 1,\dots,n \}$. In this paper, we focus on problems where the state is defined using a set of variables. $\mathcal{T}: \mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]$ is a transition probability function, $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is a reward function, and $\gamma$ is the discount factor. A policy $\pi$ is a solution to an MDP, denoted as $\pi: \mathcal{S} \rightarrow \mathcal{A}$. We consider the RL settings, where an agent needs to interact with an environment that can be modeled as an MDP with unknown $\mathcal{T}$. The objective is to learn an optimal policy that maximizes the long-term cumulative reward for this MDP. 

When the size of the space state increases significantly, most RL algorithms fail to solve the given MDP due to the \textit{curse of dimensionality}. Abstraction is a dimension reduction mechanism by which the original problem representation maps to a new reduced problem representation \citep{giunchiglia1992theory}. We adopt the general definition of state abstraction proposed by \citet{li2006towards}.

\begin{definition} 
\label{def:back}
Let $M = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{\gamma}\rangle$ be a ground MDP from which an abstract MDP $\bar{M} = \langle \bar{\mathcal{S}}, \mathcal{A}, \bar{\mathcal{T}}, \bar{\mathcal{R}}, \mathcal{\gamma}\rangle$ can be derived via a state abstraction function $\phi: \mathcal{S} \rightarrow \bar{\mathcal{S}}$, where the abstract state mapped to concrete state $s$ is denoted as $\phi(s) \in \bar{\mathcal{S}}$ and $\phi^{-1}(\bar{s})$ is the set of concrete states associated to $\bar{s} \in \bar{S}$. Further, a weighting function over concrete states is denoted as $w(s)$ with $s \in \mathcal{S}$ s.t. for each $\bar{s} \in \bar{\mathcal{S}}$, $\sum_{s \in \phi^{-1}(\bar{s})} w(s) = 1$, where $w(s) \in [0,1]$. Accordingly, the abstract transition probability function $\bar{\mathcal{T}}$ and reward function $\bar{\mathcal{R}}$ are defined as follows:
\begin{align}
    \bar{\mathcal{R}} (\bar{s},a) &= \sum_{s \in \phi^{-1}(\bar{s})} w(s)\mathcal{R}(s,a), 
    \nonumber \\ 
\bar{\mathcal{T}}(\bar{s}, a, \bar{s}') &= \sum_{s \in \phi^{-1}(\bar{s})} \sum_{s' \in \phi^{-1}(\bar{s})} w(s)\mathcal{T}(s,a,s'). \nonumber
\end{align}
\end{definition}

In this work, we consider a uniform weighting function, i.e., $w(s) = 1$ for all concrete states. 
When it comes to the decision-making in an abstract MDP, all concrete states associated with an abstract state $\bar{s} \in \bar{\mathcal{S}}$ are perceived identically. Accordingly, the relation between abstract policy $\bar{\pi}: \bar{\mathcal{S}} \rightarrow \mathcal{A}$ and the concrete policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ can be defined as $\pi(s) = \bar{\pi} (\phi(s))$ for all $s \in \mathcal{S}$. Further, the value functions for an abstract MDP are denoted as $V^{\bar{\pi}}(\bar{S})$, $V^*(\bar{S})$, $Q^{\bar{\pi}}(\bar{S},a)$, and $Q^*(\bar{S},a)$.  


\section{Our Approach}
\label{sec:dynamic-abstraction}

\subsection{Overview}
Starting with state variables and a simulator, we develop a domain-independent approach for dynamically computing an abstraction based on the dispersion of TD errors in abstract states. The idea of dynamic abstraction is to learn a problem's solution and abstractions simultaneously. We propose a top-down abstraction refinement mechanism by which the learning agent effectively refines an initial coarse abstraction through acting and learning. We illustrate this mechanism with an example.

\begin{figure}
\centering
\includegraphics[scale=0.16]{images/example}
\caption{\small An example of dynamic heterogeneous abstraction refinement for a Wumpus world.}
\label{fig:grid-mdp}
\end{figure}

\begin{example}
\label{exmp:grid}
Consider a 4x4 Wumpus world consisting of a pit at (2,2) and a goal at (4,4). In this domain, every movement has a reward -1. Reaching the goal results in a positive reward of 10 and the agents receive a negative reward -10 for falling into the pit. The goal and the pit are the terminal states of the domain. The agent's actions include moving to non-diagonal adjacent cells at each time step s.t. $\mathcal{A}$ = \{up, down, left, right\}.
\end{example}


Considering Example \ref{exmp:grid}, Fig. \ref{fig:grid-mdp} (left) shows a potential initial coarse abstraction in which the domain of each state variable (here x and y coordinates) is split into two abstract values and $\bar{S}_1$ and $\bar{S}_4$ contain the pitfall and goal location respectively. As a result, when learning, the agent will observe a high standard deviation on the TD errors of $(\bar{S}_1,right), (\bar{S}_1,down), (\bar{S}_4,right),$ and $(\bar{S}_4,down)$ because of the presence of terminal states with large negative or positive rewards. Guided by this dispersion of TD errors, the initial coarse abstraction should be refined to resolve the observed variations. Fig. \ref{fig:grid-mdp} (right) exemplifies an effective abstraction refinement for Example~\ref{exmp:grid} demonstrated as a heatmap of TD errors. Notice that the desired abstraction is a heterogeneous abstraction on the domains of state variable values where the abstraction on a variable depends on the value of the other variables: let $x$ and $y$ domains be $\{ 1, 2, 3, 4 \}$. When $y > 2$, the domain of $x$ (originally $\{ 1, 2, 3, 4 \}$) is abstracted into sets $\{ 1,2 \}$, $\{ 3\}$, and $\{ 4 \}$, but when $y \leq 2$, the domain of $x$ is abstracted into sets $\{ 1 \}$, $\{ 2 \}$, and $\{ 3,4 \}$.


\subsection{Conditional Abstraction Trees}
\label{sec:hat}

Recall that the value of a state variable $v_i$ inherently falls within a known domain. Partitioning these domains is one possible way to construct state abstractions. The abstraction of one state variable is conditioned on a specific range of any other state variables. Accordingly, we need to maintain and update such conditional abstractions via structures that we call Conditional Abstraction Trees (CATs). 

\begin{figure}
\centering
\includegraphics[scale=0.32]{images/hat} 
\caption{\small This figure illustrates a Conditional Abstraction Tree (CAT) for Example \ref{exmp:grid}. Ranges written inside the nodes represent $\theta_i \in \Theta$. Each node represents a conditional abstraction.}
\label{fig:hat}
\end{figure}

Fig. \ref{fig:hat} (right) exemplifies a partially expanded CAT for the problem in Example \ref{exmp:grid}. The tree's root node contains the global ranges (the first range refers to the horizontal location $x$ and the second range refers to the vertical location $y$) for both of these state variables representing an initial coarse abstraction (in white). The annotations visualize how this initial abstraction can be further refined w.r.t. a state variable resulting in new conditional abstractions (repetitive annotations are not shown for the sake of readability). The refinement procedure of the Wumpus world associated with each level of the tree is also displayed in Fig. \ref{fig:hat} (left).

Given the set of state variables $\mathcal{V}$, we define an abstract state using the set of partitions, one for each variable $v_i$, where each partition $\theta_i$ is an interval of the form $[l_i, h_i]$. Thus, the coarse abstract state for Example \ref{exmp:grid} could be defined by $\theta_1 = [1,4]$ and $\theta_2 = [1,2]$. An abstraction is defined as $\Theta = \{\theta_i | i = 1,\dots,n\}$, where $n = \lvert \mathcal{V} \rvert$. In fact, CAT is a hierarchical abstraction tree starting with an initial abstraction $\Theta_{init}$ that represents the original range for each state variable $v_i \in s$ s.t. $\Theta_{init} = \{ \theta_i | i = 1,\dots,n \textrm{ and } l_i = v_i^{min} \textrm{ and } h_i = v_i^{max} \}$, where $v_i^{min}$ and $v_i^{max}$ denote the lower and upper bounds on the range of $v_i$ respectively. In Example \ref{exmp:grid}, there are two state variables so the initial abstraction is $\Theta_{init} = \{ [1,4],[1,4] \}$. The initial abstraction also induces the starting coarse abstraction since the range for each state variable suggests that all values for all state variables are compressed into one abstract state. 

This initial coarse abstraction induced by the initial abstraction $\Theta_{init}$ needs to be further refined so that the learning agent can improve its performance through a more fined representation. Let $\Theta$ be an abstraction. We define a refinement function $\delta(\Theta,i,f)$ that splits the range of partition $\theta_i \in \Theta$ of state variable $v_i$ into $f$ equal ranges resulting in $f$ new abstractions. Now, we formally define the refinement function $\delta(\Theta,i,f)$.

\begin{definition} 
\label{def:refinement}
Let $\Theta = \langle \theta_1, \ldots, \theta_n \rangle$ be an abstract state for a domain with variables $v_1, \ldots , v_n$. We define the f-split refinement of $\Theta$ w.r.t. variable $i$ as $\delta(\Theta, i, f) = \{ \Theta^1, \ldots, \Theta^f\}$ where all $\Theta^j$’s are the same as $\Theta$ on every $\theta_k$ for $k\ne i$. $\theta_i = [l, h]$ is partitioned with $f$ new boundaries at least $\|\theta\|/f$ values apart: $l, l_1, l_2,\ldots, l_{f-1}, h$ where $l_x = l + x\times \lfloor[(h-l)/f]\rfloor$. 
\end{definition}

Next, we need to define the relation between two given abstractions in the form of $\Theta$ in order to determine if one is obtained by refining the other.    

\begin{definition}
Let $\Psi$ be the set containing all possible abstractions. Given $\Theta_a,\Theta_b \in \Psi$, we say $\Theta_b$ is obtained by refining $\Theta_a$, denoted as $\Theta_b \triangleright \Theta_a$, iff $\; (\forall i \in[1,n]) (\theta_i^b \subseteq \theta_i^a)$. Moreover, $\Theta_b \triangleright \Theta_a \equiv \Theta_a \triangleleft \Theta_b$. Although this definition determines an ancestral relation between $\Theta_a$ and $\Theta_b$, we need to know the factor $f$ by which $\Theta_a$ has been refined to determine if $\Theta_b$ is the direct result of refining $\Theta_a$. We say $\Theta_b$ is obtained directly by refining $\Theta_a$, denoted as $\Theta_b \trianglerighteq \Theta_a$, iff $\; \exists \; i \; (\theta_i^b \subset \theta_i^a)$, $(\forall k_{\neq i} \in[1,n]) (\theta_k^b = \theta_k^a)$ and $\lvert \theta_i^b \rvert \times f = \lvert \theta_i^a \rvert$.
\end{definition}

With these definitions in hand, we can now formally define CAT as a tree to construct and maintain the hierarchy of conditional partitions. A CAT, denoted as $\xi$, represents a tree structure specifying the hierarchy of conditional abstractions in the form of $\Theta$. 

\begin{definition}
\label{def:hat}
A conditional abstraction tree (CAT) is defined as $\xi = \{N, E\}$, where $N$ is the set of nodes and $E$ is the set of edges. Each node in $N$ corresponds to an abstraction $\Theta$, s.t. $N = \{ \Theta_m | m \in [1,n_{\xi}] \}$, where $n_{\xi}$ is the cardinality of CAT and the root node of the tree is the initial abstraction $\Theta_{init}$. Every parent $\Theta_p$ and child $\Theta_c$ nodes in $\xi$ are connected via an edge $e_p^c$ s.t. $e_p^c \implies \Theta_c \trianglerighteq \Theta_p$. $L_{\xi} = \{ \Theta_m | (\forall k \in [1,n_{\xi}]) ( \Theta_k \ntrianglerighteq \Theta_m) \}$ is defined as the set of leaf nodes representing the set of abstract states.
\end{definition}

\begin{algorithm}[t]
% \begin{algorithm}[tb]
\caption{State Abstraction}
\label{alg:hat-search}
\textbf{FindAbstract} (CAT $\xi$, $\Theta_{start}$, $s$):
\vspace{-0.9em}
\begin{algorithmic}[1] %[1] enables line numbers
\IF {$(\forall v_i \in s) (v_i \in \theta_i^{start})$} \label{l:inclusion}
\IF {$\Theta_{start} \in L_{\xi}$ }
\STATE \textbf{return} $\Theta_{start}$
\ELSE
\STATE $children \leftarrow Children(\Theta_{start})$
\FOR{$\Theta_{child} \in children$}
\IF {$(\forall v_i \in s) (v_i \in \theta_i^{child})$} \label{l:inclusion_children}
\STATE \texttt{FindAbstract} ($\xi$, $\Theta_{child}$, $s$) \label{l:recursion}
\ENDIF
\ENDFOR
\ENDIF
\ENDIF
\end{algorithmic}
\end{algorithm}

Given a CAT $\xi$ and a concrete state $s$, the mapping $\phi(s): \mathcal{S} \rightarrow \bar{\mathcal{S}}$ can be done via a level-order tree search starting from $\Theta_{init}$. The corresponding abstract state $\bar{s}$ is the node $\Theta_{\emph{found}}$ \textit{iff} $\forall i \in [1,n] \; v_i \in \theta_i^{\emph{found}}$ (inclusion condition) and $\Theta_{\emph{found}}$ is a leaf node, i.e., $\Theta_{\emph{found}} \in L_{\xi}$. Alg. \ref{alg:hat-search} computes the $\phi: \mathcal{S} \rightarrow \bar{\mathcal{S}}$ mapping for a given concrete state $s$ under CAT $\xi$, starting from CAT's root node $\Theta_{init}$. $\texttt{FindAbstract}(\xi, \Theta_{start}, s)$ starts the level-order search from $\Theta_{start}$ and it always finds the corresponding abstract state when $\Theta_{start} = \Theta_{init}$. This algorithm checks the inclusion condition first for $\Theta_{start}$ (Line \ref{l:inclusion} in Alg. \ref{alg:hat-search}). If $\Theta_{Start}$ is not a leaf node, the algorithm checks the inclusion condition for children of $\Theta_{start}$ (Line \ref{l:inclusion_children} in Alg. \ref{alg:hat-search}) and if a child satisfies the condition, $\texttt{FindAbstract}$ gets invoked recursively (Line \ref{l:recursion} in Alg. \ref{alg:hat-search}).

Any state abstraction under a given CAT $\xi$ induces an abstract representation of the underlying concrete MDP $M$. Thus, an MDP $M$ can have two abstract representations $\bar{M}_a$ and $\bar{M}_b$ under two CATs $\xi_a$ and $\xi_b$ respectively. We define a relational operation to decide which abstract MDP is finer.
\vspace{-0.5em}
\begin{definition}
Given MDPs $\bar{M}_a$ and $\bar{M}_b$ abstracted under $\xi_a$ and $\xi_b$, we say $\bar{M}_a$ is strictly finer than $\bar{M}_b$, denoted as $\bar{M}_a \succ \bar{M}_b$, iff $\forall \Theta^a \in L_{\xi_a} \; \exists \Theta^b \in L_{\xi_b} \; (\Theta^a \trianglerighteq \Theta^b)$. We also say $\bar{M}_a$ is finer than $\bar{M}_b$, denoted as $\bar{M}_a \succeq \bar{M}_a$, iff $\forall \Theta^a \in L_{\xi_a} \; \exists \Theta^b \in L_{\xi_b} \; (\Theta^a \trianglerighteq \Theta^b \vee \Theta^a = \Theta^b)$.   
\end{definition}


\subsection {Learning Dynamic Abstractions}
\label{sec:learning-dynamic-abs}
Definition \ref{def:hat} formalizes the abstraction tree by which the mapping $\phi(s):\mathcal{S} \rightarrow \bar{\mathcal{S}}$ can be performed using a level-order search (see Alg. \ref{alg:hat-search}), while Definition \ref{def:refinement} explains how a node of a CAT can be refined w.r.t. a state variable $v_i$ through the refinement function $\delta(\Theta, i, f)$. However, our objective is to interleave RL training with phases of abstraction refinement leading to an enhanced abstract policy $\bar{\pi}$ for a given concrete MDP $M$. To this end, \alg{} consists of three phases explained below: 

\emph{Learning phase}. Starting with an initial coarse abstraction, the RL agent interacts with the environment and learns an abstract policy $\bar{\pi}$. The learning phase of \alg{} is a standard RL routine where the agent learns the abstract policy $\bar{\pi}$ through $\phi(s): \mathcal{S} \rightarrow \bar{\mathcal{S}}$ mapping under a CAT. We employ a vanilla Q-leaning algorithm on the abstract state space as the underlying RL algorithm of \alg{}. 

\emph{Abstraction evaluation phase}. Since the initial coarse abstraction is likely to be too coarse, \alg{} should refine the CAT $\xi$ to construct a more effective abstraction. To identify the abstract states that need further refinement, \alg{} starts the abstraction evaluation phase to collect some samples of TD errors throughout the Q-learning process over abstract states. Thus, in the abstraction evaluation phase, the RL agent continues interacting with the environment via epsilon-greedy variant of the fixed abstract policy $\bar{\pi}$ and \alg{} evaluates the existing abstraction under the CAT $\xi$ by logging the dispersion of TD errors over abstract states. Let $\beta(M, \xi, \bar{\pi}, n_{\emph{eval}})$ denote the evaluation function which runs the underlying RL routine for $n_{\emph{eval}}$ episodes with the fixed stochastic abstract policy for a given MDP $M$ and CAT $\xi$. Throughout the abstraction evaluation phase, the observed dispersion of TD errors is defined as $\Gamma = \{ d_m | m \in [1, n_{\emph{visited}}] \}$, where $n_{\emph{visited}}$ is the number of visited abstract states during the abstraction evaluation phase and $d_m$ denotes the set of logged $Q^{\bar{\pi}}(\bar{s},a)$ values for a visited abstract state $\bar{s}$. When the abstraction evaluation phase is done, $\beta$ returns the dispersion of TD errors in the form of $\Gamma$.

\emph{Refinement phase}. Once the abstraction evaluation phase is terminated, the dispersion of TD errors $\Gamma$ will be available from which the refinement phase of \alg{} can be initiated. In $\Gamma$ there might be multiple logs of TD errors for the same pair of abstract state and action ($\bar{s}, \bar{\pi}(a)$). Since the policy was fixed until the agent changes an abstract state throughout the abstraction evaluation phase, a high variation of TD errors of the same pair of ($\bar{s}, \bar{\pi}(a)$) indicates that the abstract state $\bar{s}$ is unstable, i.e., represents significantly disparate concrete states, and requires further refinement. Therefore, the first step of the refinement phase is to find the top $k$ unstable states of the CAT $\xi$. Let $\texttt{UnstableStates}(\Gamma)$ denote a function that finds the set of unstable states in the form of $\Theta$ based on $\Gamma$. For each visited abstract state in $\Gamma$, \texttt{UnstableStates} calculates the maximum normalized standard deviation of TD errors over all actions. Then, \texttt{UnstableStates} uses \emph{k-means} clustering technique to find and return the top $k$ unstable states among all of the visited abstract states in $\Gamma$. Each unstable state can be refined by splitting into $f$ new states w.r.t a state variable $i$ following the definition of f-split refinement in Definition. \ref{def:refinement}. However, the question is: what state variable should \alg{} blame for the observed instability in an unstable state? As discussed, \alg{} learns an abstract policy $\bar{\pi}$ over abstract states so it maintains and updates the Q-table for abstract states to find the optimal abstract policy. However, for problems with discrete state space, \alg{} can also maintain and update the Q-table for concrete states. This concrete Q-table can be further used for various applications such as finding contributing state variables for an unstable state. Let $\texttt{UnstableVar}(\Gamma, \Theta)$ denote a function that refines an unstable state, in the form of $\Theta$, w.r.t a state variable that results in the most consistent new abstract states. Basically, splitting an abstract state over a state variable results in $f$ new abstract states. Now, for each newly created abstract state, \alg{} calculates the normalized standard deviation of the TD errors. Intuitively, if all concrete states under any of the newly created abstract states have TD errors with small standard deviation for the same action $a$, then splitting over that state variable would be the near-optimal refinement and can potentially decrease/resolve the instability in the abstract state. \texttt{UnstableVar} repeats this process for all state variables and chooses the one that minimizes the normalized standard deviation of the underlying TD errors on the concrete level. In Sec. D of the supplementary document, we also presented an alternative approach for \texttt{UnstableVar} that aggressively refines an abstract state w.r.t all state variables.

\alg{} repeats the learning, evaluation, and refinement phases sequentially until the RL agent learns an abstract policy $\bar{\pi}$ and a CAT $\xi$ that successfully and effectively learns the solution and abstractions to the MDP $M$. 

\vspace{-1em}
\begin{algorithm}[t]
\caption{Learning Dynamic Abstractions}
\label{alg:main}
\textbf{Input}: $M, f$ \\
\textbf{Output}: $\bar{M}, \xi, \bar{\pi}$
\begin{algorithmic}[1] %[1] enables line numbers
\STATE initialize $\Theta_{init}$, $\xi$, and $\bar{Q}$ \label{line:initialize}
\FOR{$episode = 1, n_{epi}$} \label{line:routine_begin}
\STATE $s \leftarrow \texttt{reset()}$
\FOR {$steps$ in $episode$}
\STATE $\bar{s} \leftarrow \texttt{FindAbstract}(\xi, \Theta_{init}, s)$
\STATE $a \leftarrow \bar{\pi}(\bar{s})$
\STATE $s', \bar{r}, done  \leftarrow \texttt{step}(\texttt{extend}(a))$ \label{line:extend}
\STATE $\bar{s'} \leftarrow \texttt{FindAbstract}(\xi, \Theta_{init}, s')$
\STATE $\bar{\pi} \leftarrow \texttt{train}^{\bar{\pi}}(\bar{s}, \bar{s}', a, \bar{r})$ \label{line:train}
\STATE $s, \bar{s} \leftarrow s', \bar{s}'$ \label{line:routine-end}
\ENDFOR
\IF {$\bar{M}$ needs refinement} \label{line:refinement-condition}
\STATE $\Gamma \leftarrow \texttt{evaluate} (M, \xi, \bar{\pi}, n_{\emph{eval}})$ \label{line:sim}
\STATE $unstable \leftarrow \texttt{UnstableStates}(\Gamma)$ \label{line:unstable}
\FOR {each $\Theta$ in $unstable$} \label{line:update-tree-loop}
\STATE $i \leftarrow \texttt{UnstableVar}(\Gamma, \Theta)$ \label{line:var}
\STATE $nodes \leftarrow \texttt{refine}(\Theta, i, f)$ \label{line:refine}
\STATE $\xi \leftarrow \texttt{UpdateTree}(\xi, \Theta, nodes)$ \label{line:end-update-tree}
\ENDFOR
\ENDIF
\ENDFOR
\STATE \textbf{return} $\bar{M}, \xi, \bar{\pi}$
\end{algorithmic}
\end{algorithm}

\subsection{\alg{} Algorithm}
Alg. \ref{alg:main} illustrates the procedure by which the agent learns an MDP's solution and abstractions simultaneously through learning, evaluation, and refinement phases explained in Sec. \ref{sec:learning-dynamic-abs}. First, the initial coarse abstraction needs to be automatically constructed through initializing $\Theta_{init}$, based on the known ranges for each state variable $v_i$. Then, a CAT $\xi$ is constructed for $\Theta_{init}$ with only the root node (Line \ref{line:initialize} in Alg. \ref{alg:main}). 

The initial $\xi$ induces an abstract MDP $\bar{M}$ for the given MDP $M$. Then, the learning phase of \alg{} starts by employing the Q-learning routine (Lines \ref{line:routine_begin} to \ref{line:routine-end} in Alg. \ref{alg:main}). In this phase, \alg{} implements a vanilla Q-learning over abstract states and updates Q-values based on samples in the form of $\langle \bar{s}, a, \bar{s}', \bar{r} \rangle$. $\bar{r}$ is computed according to the formulation presented in Definition \ref{def:back}, and $\bar{s}$ and $\bar{s}'$ are returned by the function that we illustrated in Alg. \ref{alg:hat-search}. Once the samples are transformed into the form explained above, CAT+RL updates the abstract Q-table in Line \ref{line:train} of Alg. \ref{alg:main}. 

Induced by the computed state abstraction, extended actions (taking a concrete action repeatedly until the agent reaches a new abstract state, blockage, or a terminal concrete state) are applied to the environment instead of the concrete actions (Line \ref{line:extend} in Alg. \ref{alg:main}). \alg{} checks the refinement condition (Line \ref{line:refinement-condition} in Alg. \ref{alg:main}) at the end of each learning episode to initiate an abstraction evaluation phase. 

We set \alg{} to check the recent success rate of the RL agent every $n_{check}$ episodes where the refinement condition evaluates to true if the success rate is below some threshold $t_{\emph{succ}}$. The choice of the refinement condition introduces a trade-off. On one hand, we want to obtain a near-optimal abstraction that enables the agent to learn the solution effectively. On the other hand, the abstract policy $\bar{\pi}$ should be trained enough to be used in the abstraction evaluation phase for refinement purposes. When the refinement condition is true, the algorithm runs the evaluation function $\beta$ for $n_{\emph{eval}}$ episodes (Line \ref{line:sim} in Alg. \ref{alg:main}). Subsequently, the refinement phase (Lines \ref{line:unstable} to \ref{line:end-update-tree} in Alg. \ref{alg:main}) starts by finding the top $k$ unstable states (Line \ref{line:unstable} in Alg. \ref{alg:main}). Next, \alg{} finds the contributing state variable for each unstable state (Line \ref{line:var} in Alg. \ref{alg:main}) and refines it w.r.t. to the contributing state variable (Line \ref{line:refine} in Alg. \ref{alg:main}). After refining each unstable state, \alg{} updates the CAT $\xi$ by adding the new abstract states to the abstraction tree (Line \ref{line:end-update-tree} in Alg. \ref{alg:main}).


\section{Empirical Evaluation}
\label{sec:emp}
To assess the performance of \alg, we implemented the method in Python \footnote[1]{https://github.com/AAIR-lab/CAT-RL.git} and evaluated it in five domains. We executed all deep learning experiments for our baselines on two GeForce RTX 3070 GPUs with 8 GB memory running Ubuntu 18.04 and all of our other experiments on 5.0 GHz Intel i9 CPUs with 64 GB RAM running Ubuntu 18.04.  
We investigated the following questions:

\begin{itemize}
    \item[(1)] Does \alg{} improve the sample efficiency of vanilla Q-learning beyond state-of-the-art baselines without any expert knowledge?
    \item[(2)] Does \alg{} increase the scalability of its underlying RL algorithm beyond existing methods?
    \item[(3)] Does \alg{} learns symmetric structures of tasks?
\end{itemize}

\paragraph{Selection of test problems}
For the selection of test problems, we did an extensive literature study to ensure that the chosen problems are drawn from contemporary research and are challenging for state-of-the-art methods. As a result, we conducted empirical analyses on three domains with \emph{discrete states}: Office World adapted from \cite{icarte2018using}, Wumpus World derived from \cite{stuart2010artificial}, Taxi World introduced by \cite{dietterich2000hierarchical} and adapted from the OpenAI Gym environment Taxi-v3 \footnote[2]{https://www.gymlibrary.ml/environments/toy\_text/taxi/}, and two domains with \emph{continuous states}: Water World based on \cite{karpathyreinforcejs,icarte2018using} and Mountain Car from \cite{1606.01540}. We adopted significantly large instances of these domains (except for Mountain Car which has a fixed problem size) compared to the ones used in the previous work in non-imaged-based RL. Besides, all of these domains are \emph{stochastic} problems with varying dimensionality (from 2 to 14). Aside from the main empirical evaluations that are reported in Sec. \ref{sec:results}, we conducted additional scalability studies (see the supplementary document) on the Office World domain, as a case study, to ensure that the selected test problems challenge the scalability of the state-of-the-art baselines and \alg{}. The details regarding the domains and task descriptions are included in the supplementary document.

\paragraph{Selection of baselines}
For the comparative study, we selected the following baselines: (1) Option-critic \cite{bacon2017option}, (2) JIRP \cite{xu2020joint}, (3) tabular Q-learning \cite{watkins1992q}, (4) DQN \cite{mnih2013playing}, (5) A2C \cite{mnih2016asynchronous}, and (5) PPO \cite{schulman2017proximal}. Option-critic is a Hierarchical RL (HRL) approach that discovers options autonomously while learning option policies simultaneously. JIRP, a symbolic state-of-the-art RL method, automatically infers reward machines and policies for RL. We chose these state-of-the-art methods as baselines as they automatically learn different abstract representations such as options and reward machines without requiring any human-engineered inputs. We also chose state-of-the-art deep RL methods: DQN, A2C, and PPO as baselines since multiple layers in their neural network architectures progressively construct state abstractions. We use their implementations from the Stable-Baselines3 \footnote[3]{https://github.com/DLR-RM/stable-baselines3} framework by \cite{raffin2019stable}.  

\paragraph{Hyperparameters}
Throughout the empirical evaluations, we ran \alg{} with $t_{\emph{succ}} = 0.8$, $n_{\emph{eval}} = 100$, $n_{\emph{check}} = 100$, and varying values of $k$ for different domains. One important advantage of \alg{} over Deep-RL baselines is that \alg{} has only four parameters, as mentioned earlier, and performs robustly regardless of the value of its parameters as long as they are not set to drastically large or small values within their ranges. On the other hand, we have done extensive hyperparameter tuning for the baselines. The details about the used neural network architectures, parameters, and hyperparameters for baselines and \alg{} are included in the supplementary document. 

We report the mean success rates averaged over the last 100 training episodes along with the standard deviations computed from 10 independent runs for each method and domain. We also report the normalized cumulative reward obtained by evaluating the agent on 10 simulation runs, after stopping training at intervals of 10 episodes. We now discuss our results and analysis in detail below.

\begin{figure*}[t]
     \centering
         \centering
         \includegraphics[width=\textwidth]{images/plot_main.pdf}
        \caption{\small (Top) Success rates (mean and standard deviation) for 10 independent runs averaged over the last 100 training episodes for all the methods, and (Bottom) normalized cumulative reward for 10 simulation runs obtained every 10 training episodes for \alg{} (ours) and the second-best performing baseline for Office World, Wumpus World, Taxi World, Water World, and Mountain Car. Here, discrete/continuous refers to the state space of the domain.
        }
        \label{fig:result}
\end{figure*}

\subsection{Results}
\label{sec:results}
Fig. \ref{fig:result} (top) shows a comparison of success rates achieved by all the methods on all the domains. In Office World, \alg{} outperforms all the baselines and almost converges to a success rate of 1 in around 2k episodes, whereas, PPO and DQN achieve approximate success rates of 0.8 and 0.65 respectively in around 2.5k episodes and have a high standard deviation. In Wumpus World, \alg{} converges to a success rate of 1 within 4k episodes and significantly outperforms all the baselines which struggle to learn due to the complexity introduced by pitfalls, obstacles, and size of the environment. In Taxi World, \alg{} achieves the best performance within 12k episodes of training, while Q-learning performs better than all other baselines reaching a success rate of 0.75 in 20k episodes. In the Water World domain, \alg{} learns slightly faster than PPO while all other baselines perform poorly, whereas, \alg{} learns significantly faster compared to DQN, which is the best baseline, in the Mountain Car domain. We performed further evaluations on \alg{} and the second-best performing baseline on each domain as shown in Fig. \ref{fig:result} (bottom) by evaluating the policies learned by the agent and comparing the normalized cumulative reward achieved. 


\begin{table}[h]
    \centering
    \begin{tabular}{|p{22mm}|p{10mm}|p{10mm}|p{10mm}|}
    \hline 
        \vspace{0.05mm} 
        Domain  & 
        \vspace{0.05mm} 
        $|\mathcal{S}|$ & \vspace{0.05mm} 
        $|\bar{\mathcal{S}}|$ & \vspace{0.05mm} $|\mathcal{S}|/|\bar{\mathcal{S}}|$ \\
        \hline
        Water World & $\infty$ & 49144 & $\infty$\\
        Mountain Car & $\infty$ & 13 & $\infty$\\
        Taxi World & 18000 & 1552 & 11.59\\
        Office World & 5184 & 124& 41.80\\
        Wumpus World & 496 & 157 & 26.08 \\
        \hline
    \end{tabular}
    \caption{Sizes of concrete state spaces and abstract state spaces for the test problems.}
    \label{tab:state-sizes}
\end{table}

Table. \ref{tab:state-sizes} draws a comparison between the sizes of the concrete state space and the abstract state space under \alg{}. As a result of the significant reduction in the size of the concrete state space explained by the abstraction factor in Table. \ref{tab:state-sizes}, \alg{} outperforms GPU-based DRL approaches in terms of sample efficiency without relying upon expensive computational hardware and without the corresponding hyperparameter tuning. Additional run-time analyses of \alg{} and the best-performing baselines are presented in the supplementary document.


\subsection{Analysis}  
We now present our analysis of the three key questions mentioned in Sec. \ref{sec:emp}.

\textbf{1. Sample efficiency in the absence of input expert knowledge}
The results presented in Section \ref{sec:results} demonstrate that \alg{}'s sample efficiency is superior to all baselines in both discrete and continuous domains. This is categorically the effect of the learned conditional abstractions by \alg{} made available to the vanilla Q-learning algorithm. This effect can be perceived from two perspectives: 1) the meaningful conditional abstractions that are automatically constructed by \alg{} spotlight the most informative aspects of the state space, leading to more sample-efficient learning; and 2) the Q-learning agent benefits from higher levels of exploration over state and action spaces due to the nature of abstraction. This intense exploration can cause more penalization of the agent at the early stages of learning (see cumulative rewards of \alg{} in Taxi and Office worlds in Fig. \ref{fig:result}) but eventually leads to faster learning and superior performance reflected in the success rate. 


\textbf{2. Scalability to larger tasks}
RL algorithms that learn policy $\pi$ from a concrete MDP $M$ suffer from the curse of dimensionality as the size of the state space increases. This explains why most of the baselines fail to learn the Wumpus world, as a basic domain, when the size of the problem increases drastically, as shown in Fig. \ref{fig:result}. In contrast, the top-down abstraction refinement scheme of \alg{} scales effectively to problems with relatively larger state space. As a result, the abstract representations learned by \alg{} empowered the vanilla Q-Learning algorithm to learn those problems relatively fast and efficiently. We conducted further experiments on scalability and computational complexity of \alg{} and baselines and the results are presented in the supplementary document.

\begin{figure}[h]
     \centering
         \centering
         \includegraphics[width=1.04\columnwidth]{images/review_taxi.pdf}
        \caption{\small Illustration of two different components of a single CAT learned automatically by CAT+RL for a TaxiWorld problem. Abstraction on “taxi-loc-x” and “taxi-loc-y” changes depending on the value of the passenger-location variable.
        }
        \label{fig:review-taxi}
\end{figure}

\begin{figure}[t]
\centering
\includegraphics[scale=0.8]{images/heatmaps}
\caption{\small Drawing out similarities across state space of a 8$\times$8 taxi world via \alg's automatic abstraction. Left) R, G, Y, and B are the four predefined pickup/drop-off locations. Middle) Location D is the destination location and the passenger is on the top-left location; and right) Location D is the destination and the passenger is in the taxi.}
\label{fig:heat}
\end{figure}


\textbf{3 (a). Abstractions identify sub-tasks within a problem} Fig. \ref{fig:review-taxi} illustrates the CAT-based abstraction that was computed automatically by CAT+RL for a Taxi domain problem (we are using a small problem instance for clarity in the illustration). When a passenger needs to be picked up, CAT abstraction preserves precision around the passenger’s location; when the passenger is in the taxi and needs to be dropped off, cells around the passenger’s previous location are no longer significant for distinguishing TD errors, they get merged together, and precision increases around the destination location. This is learned and expressed without human intervention by \alg{}. It is important to realize that these two abstractions are expressed within one learned CAT\textemdash they are different subtrees. One subtree is “active” when the passenger's location is P, and the other is “active” when the passenger’s location is the taxi. In an interval abstraction, both parts of the Taxi World (bottom left as well as bottom right) would end up getting refined as episodes continue, thus losing aggregation opportunities and increasing sample complexity. In contrast, CAT abstractions are dynamic (the representation changes depending on the current state) and heterogeneous (the same variable has different “splits” based on the values of other values). This allows our approach to aggregate experience where possible while dynamically increasing resolution on critical-choice paths.


\textbf{3 (b). Abstractions identify symmetry across sub-tasks}
One important property of \alg's framework is to construct identical abstractions across the state space for similar sub-problems. This capability of \alg{} can be useful in large problems where options can be generalized across identically constructed abstractions. Fig. \ref{fig:heat} demonstrates two constructed conditional abstractions by \alg{} for an 8$\times$8 taxi world. In Fig. \ref{fig:heat} (middle), the passenger is located at the top-left and the destination is located at the bottom-left of the map. Besides, in Fig. \ref{fig:heat} (right), the passenger is in the taxi and the destination is located at the top-left. In both cases, the agent should reach the top-left cell of the map which implies a similarity. \alg{} discovered this similarity automatically as seen from the generated identical abstractions (highlighted area) for both cases.

\section{Conclusion}
\label{sec:conclusion}
We presented a novel approach for simultaneously learning dynamic abstract representations along with the solution to problems formulated as an MDP. The overall algorithm of \alg{} proceeds by interleaving the process of refining a coarse initial abstraction with learning and evaluation of policies for the underlying RL agent. We introduced conditional abstraction trees to compute and represent such refined abstractions throughout the \alg{} procedure. Extensive empirical evaluations demonstrated that \alg{} effectively enables the vanilla Q-learning algorithm to learn the solution to large discrete and continuous problems, with dynamic representations, where state-of-the-art RL algorithms are outperformed. This superior performance of vanilla Q-learning compared to algorithms with complex neural-network-based architectures is due to \alg{}'s scalable abstraction construction scheme that effectively draws out similarities across the state space and yields powerful sample efficiency in learning. Future work will consider the automatic discovery of generalizable options utilizing the constructed conditional abstract representations by \alg{}. 

\begin{acknowledgements}
This work was supported in part by NSF IIS grant  1942856 and ONR grant N00014-23-1-2416.
\end{acknowledgements}

% References
\bibliography{dadvar_701}

\end{document}
