\section{Algorithm}\label{sec:algo}
The proposed algorithm, \algo~discretizes the state-action space in a non-uniform grid adaptively, and the grid becomes finer as time progresses. In this section, first, we explain the adaptive discretization process.
\begin{defn}[Cells]\label{def:cell}
    A cell is a dyadic cube with vertices from the set $\{2^{-\ell}(v_1, v_2, \ldots, v_d): v_j \in \{0,1,\ldots,2^\ell\}, j=1,2,\ldots,d\}$ with sides of length $2^{-\ell}$, where $\ell\in\bN$. The quantity $\ell$ is called the level of the cell. We also denote the collection of cells of level $\ell$ by $\cP^{(\ell)}$.~For a cell $\zeta\subseteq \cS \times \cA$, its $\cS$-projection is called an $\cS$-cell,
    \begin{align}
        \pi_\cS(\zeta) :& = \left\{s \in \cS \mid (s,a) \in \zeta \textit{ for some } a \in \cA \right\},
    \end{align}
    and its level is the same as that of $\zeta$.~Denote the set of $\cS$-cells of level $\ell$ by $\cQ^{(\ell)}$.~For a cell\slash $\cS$-cell $\zeta$, we let $\ell(\zeta)$ denote its level, and let $q(\zeta)$ denote a point from $\zeta$ that is its unique representative point. $q\inv$ maps a representative point to the cell\slash $\cS$-cell that the point is representing, i.e., $q\inv(z) = \zeta$ such that $q(\zeta) = z$.\footnote{With a slight abuse of notation, we use the maps $\ell(\cdot)$, $q(\cdot)$ and $q\inv(\cdot)$ for both cells and $\cS$-cells. Note that for cells and $\cS$-cells, these maps have different domains and codomains.}
\end{defn}

\begin{defn}[Partition tree]\label{def:part_tree}
    A partition tree of depth $\ell$ is a tree in which
    (i) Each node at a depth $m \leq \ell$ of the tree is a cell of level $m$. (ii) If $\zeta$ is a cell of level $m$, where $m<\ell$ then, a) all the cells of level $m+1$ that collectively generate a partition of $\zeta$, are the child nodes of $\zeta$. The corresponding cells are called child cells, and we use $\textit{Child}(\zeta)$ to denote all the child cells of $\zeta$. b) $\zeta$ is called the parent cell of these child nodes.~The set of all ancestor nodes of cell $\zeta$ is called ancestors of $\zeta$.
\end{defn}
\algo~\eqref{algo:zorl} maintains a set of ``active cells.''~The following rule is used for activating and deactivating cells.
\begin{defn}[Activation rule]\label{def:activationrule}
    For a cell $\zeta$ define, 
    \begin{align}
        N_{\max}(\zeta) &:= \frac{c_a 2^{d_\cS+2} \log{\br{\frac{T}{\delta}}}}{\diamc{\zeta}^{d_\cS+2}}, \label{Nmax} \mbox{ and},\\
        N_{\min}(\zeta) &:= \begin{cases}
            ~~1 &\mbox{ if } \zeta = \cS \times \cA\\
            \frac{c_a \log{\br{\frac{T}{\delta}}}}{\diamc{\zeta}^{d_\cS+2}}, &\mbox{otherwise,}\label{Nmin}
        \end{cases}
    \end{align}
    where $c_a>1$ is a constant that satisfies \eqref{def:ca}, and $\delta \in (0,1)$ is the confidence parameter.~The number of visits to $\zeta$ is denoted $N_t(\zeta)$ and is defined as follows.
    \begin{enumerate}
        \item Any cell $\zeta$ is said to be active if $N_{\min}(\zeta) \leq N_t(\zeta) < N_{\max}(\zeta)$.
        \item $N_t(\zeta)$ is defined for all cells as the number of times $\zeta$ or any of its ancestors has been visited while being active until time $t$, i.e.,
    \begin{align}\label{def:visitcounter}
        N_t(\zeta) &:= \sum_{i=0}^{t-1}{\ind{(s_i, a_i) \in \zeta_i}},
    \end{align}
    where $\zeta_i$ is the unique cell that is active at time $i$ and satisfies $\zeta \subseteq \zeta_i$.
    \end{enumerate}
Denote the set of active cells at time $t$ by $\cP_t$.
\end{defn}
We note that since the diameter of a child cell is half that of its parent, a parent cell is deactivated, and its child cells are activated simultaneously.~Since a cell is partitioned by its child cells, the set of active cells at time $t$, $\cP_t$ forms a partition of the state action space. \algo~clusters all the state-action pairs into the active cells by utilizing the information gathered until $t$. Each point in an active cell (cluster) $\zeta$ looks similar for the purpose of generating optimal actions, and is hence represented via its unique representative point $q(\zeta)$.~Denote the collection of representative points of the active cells at time $t$ by $\cZ_t:= \flbr{q(\zeta): \zeta \in \cP_t}$.~Let $\ell_{\max,t}$ be the level of the smallest cells in $\cP_t$.~At time $t$,~\algo~partitions the state-space into $\cS$-cells of level $\ell_{\max,t}$. We denote this $\cS$-cell partition by $\cQ_t$, i.e., $\cQ_t := \cQ\uc{\ell_{\max,t}}$, and the corresponding representative points by $\cS_t$, i.e., $\cS_t := \flbr{q(\zeta) : \zeta \in \cQ_t}$. $\cS_t$ can be thought of as the discretized state space at time $t$. \algo~maintains estimates of the transition probability kernel that has support on $\cS_t$.

Now, we introduce a generic notation for discretized transition kernels, which will be used often in this paper.~Let $\tilde{\cS}$ be a set of representative points of a partition of $\cS$ consisting of only $\cS$-cells. Then, for a continuous transition kernel $\tilde{p}$, and $\tilde{\cZ} \subseteq \cS \times \cA$, we define $\wp_{\tilde{\cZ} \to \tilde{\cS},\tilde{p}}(z,\cdot): \tilde{\cZ} \mapsto [0,1]^{\tilde{\cS}}$ as follows, 
\begin{align}\label{def:disc_p}
    \wp_{\tilde{\cZ} \to \tilde{\cS},\tilde{p}}(z,s) := \tilde{p}(z,q\inv(s)),~\forall z \in \tilde{\cZ}, s \in \tilde{\cS}.
\end{align}
The kernel $\wp_{\tilde{\cZ} \to \tilde{\cS},\tilde{p}}$ can be viewed as a discretization of $\tilde{p}$.

\textbf{Estimating the Transition Kernel.}
~Let $N_t(\zeta, \xi)$ be the total number of transitions from a cell $\zeta$, or from its active ancestors to a $\cS$-cell $\xi$ until $t$, i.e., $N_t(\zeta, \xi) := \sum_{i=1}^{t-1}{\ind{(s_i, a_i, s_{i+1}) \in \zeta_i \times \xi}}$.~For any state-action pair $z$, we let $q\inv_t(z)$ denote the active cell that contains $z$.~Denote $\tilde{\cS}_t(z) := \{q(\xi) : \xi \in \cQ^{(\ell(q\inv_t(z)))}\}$, which is the set of representative states of the $\cS$-cells of level $\ell(q\inv_t(z))$.~We first construct an estimate $\hat{p}^{(d)}_t$~\eqref{eq:kernel_estimate} of the discretized version of the true stochastic kernel as follows,
\begin{align}\label{eq:kernel_estimate}
    \hat{p}^{(d)}_t(z,s) := \frac{N_t\br{q\inv(z), q\inv(s)}}{1 \vee N_t\br{q\inv(z)}},
\end{align}
$z \in \cZ_t, s \in \tilde{\cS}_t(z)$.~Note that the distribution $\hat{p}^{(d)}_t(z,\cdot)$ is supported on a finite set $\tilde{\cS}_t(z)$, and the sets $\{\tilde{\cS}_t(z)\}$ are adaptive.~$\hat{p}^{(d)}_t(z,\cdot)$ is then extended to obtain a continuous kernel $\hat{p}_t$.~$\hat{p}_t$ is defined as,
\begin{align}\label{def:p_hat}
    \hat{p}_t(z,B) := &\sum_{s \in \tilde{\cS}_t(z)}{\frac{\lambda(B \cap q\inv(s))}{\lambda(q\inv(s))} \hat{p}^{(d)}_t(z,s)},
\end{align}
where $z \in \cZ_t$, $B \in \cB_\cS$, and $\lambda(\cdot)$ is the Lebesgue measure on $(\cS,\cB_\cS)$.~To obtain a computationally feasible algorithm, we work with the discretization $\wp_{\cZ_t \to \cS_t,\hat{p}_t}$ of $\hat{p}_t$.

Note that the set $\tilde{S}_t(z)$ depends upon the diameter of the active cell containing $z$, so that the support of the discrete kernel $\hat{p}^{(d)}_t(z,\cdot)$ varies with $z$.~The construction of $\wp_{\cZ_t \to \cS_t,\hat{p}_t}$ from $\hat{p}^{(d)}_t$ ensures that the support of the discrete kernel at every point is the same ($\cS_t$). This allows us to use the \evi~algorithm, which will be introduced later in this section.

\textbf{Concentration Inequality.}~\algo~constructs a confidence ball centered at $\wp_{\cZ_t \to \cS_t,\hat{p}_t}$ that contains discretized version of the true transition kernel, $p$ w.h.p. For a cell $\zeta \in \cP_t$, the confidence radius associated with the estimate $\wp_{\cZ_t \to \cS_t,\hat{p}_t}(q(\zeta),\cdot)$ is defined as follows,
\begin{align}\label{def:eta_k}
    \eta_t(\zeta) &:= \min \Bigg\{2,  3 \br{\frac{c_a \log{\br{\frac{T}{\delta}}}}{N_t(\zeta)}}^\frac{1}{d_S + 2} \notag\\
    &\qquad\qquad + (3 L_p + C_p) \diamc{\zeta}\Bigg\},
\end{align}
where $C_p$ is an upper bound on the derivatives of the transition density functions, as described in Assumption~\ref{assum:bdd_der}, and the constant $c_a \geq 1$ satisfies~\eqref{def:ca}.~It turns out that the following value of $c_a$ satisfies~\eqref{def:ca}:
\begin{align}
    c_a = \frac{2 d^{\frac{d_\cS}{2}}}{9} \frac{\log{\br{6 d^\frac{d}{2}}}}{\log{\br{\frac{T}{\delta}}}} + \frac{d}{d_\cS+2} + 1. \label{value:ca}
\end{align}
Lemma~\ref{lem:conc_ineq} shows that w.h.p.,
\begin{align*}
    \norm{\wp_{\cS\times\cA \to \cS_t,p}(z,\cdot) - \wp_{\cZ_t \to \cS_t,\hat{p}_t}(q(\zeta),\cdot)}_{TV}& \leq \eta_t(\zeta),\\
    &\forall z \in \zeta,
\end{align*}
for every $t$ and every $\zeta \in \cP_t$.~This leads to the definition of the confidence ball that \algo~uses.

Now, we introduce the discrete state-action space that we will use in the definition of the confidence ball.~The set of all the relevant cells for $s \in \cS$ at time $t$ are defined as $\rel{t}{s} := \{ \zeta \in \cP_t \mid \exists a \in \cA \mbox{ such that } (s,a) \in \zeta \}$.~These are those active cells whose $\cS$-projection contain the state $s$.~Thus, $\rel{t}{s}$ can be seen as the set of those cells in the state-action space that are associated with state $s$ currently. Recall that $\cS_t$ is the discrete state space at time $t$.~Define
\begin{align*}
    &\cA_t(s) \\
    &:= \cup_{\zeta \in \rel{t}{s}}{\left\{a \in \cA \mid q(\zeta) = (s\up,a) \mbox{ for some } s\up \in \cS \right\}}. %\{ q(\pi_\cA(\zeta)) \mid \zeta \in \rel{t}{s} \}, 
\end{align*}
$\cA_t(s)$ denotes the set of actions that are available to the agent that can be played by it currently in state $s$. The discrete action space at time $t$ is given by $\cA_t := \{\cA_t(s) : s \in \cS_t\}$.~Let $\cS_t \times \cA_t := \{(s,a) \mid s \in \cS_t, a \in \cA_t(s)\}$.~Define the confidence ball,
\begin{align}
    &\cC_t := \notag\\
    &\big\{\te: \cS_t \times \cA_t \mapsto [0,1]^{\cS_t} \mid \sum_{s \in \cS_t}{\te(z,s)} = 1,~\forall z \in \cS_t \times \cA_t, \notag\\
    & \norm{\te(z\up,\cdot) - \wp_{\cZ_t \to \cS_t, \hat{\wp}_t}(\bar{z},\cdot)}_1 \leq \eta_t(q\inv(\bar{z})) \mbox{ for every } \notag\\
    & \bar{z} \in \cZ_t, z\up \in q\inv(\bar{z}) \cap \cS_t \times \cA_t\big\}.\label{def:confball}
\end{align}

As a consequence of Lemma F.1, $\cC_t$ contains $\wp_{\cS_t \times \cA_t \to \cS_t, p}$ w.h.p. Denote the time when the $k$-th episode of \algo~begins by $\tau_k$.~At the beginning of each episode $k$, \algo~constructs a set of discrete MDPs $\cM^{+}_{\tau_k}$ with transition kernel can be chosen from $\cC_{\tau_k}$, and reward function is equal to the true rewards at the discrete points $\cS_t\times\cA_t$, plus a bonus term. Such a set of MDPs is called the ``extended MDP'' and it is commonly used to incorporate optimism in upper confidence bound-based RL algorithms~\citep{jaksch2010near}. The optimal average reward of the extended MDP exceeds the optimal average reward of the true MDP since $\cC_t$ contains the true discretized transition kernel $\wp_{\cS_t \times \cA_t \to \cS_t, p}$ w.h.p.; this yields an ``optimistic push'' which ensures ``sufficient exploration.''~The confidence ball shrinks with the number of visits to different state-action pairs; this causes a reduction in the amount of optimism bonus. The extended MDP thus closely approximates the true MDP in the ``important regions'' (those necessary for recovering an optimal policy) of the state-action space as time progresses.~Next, we discuss the extended MDP in detail, how to solve it, and its role in \algo.

\textbf{Extended MDP.}~Consider the following modified reward function defined on $\cS_t \times \cA_t$,
\begin{align*}
    \tilde{r}_t(s,a) &= r(q(q\inv_t(s,a))) + L_r \diamc{q\inv_t(s,a)}, %\label{def:bonus_reward}
\end{align*}
in which a bonus term proportional to the diameter of the active cell that contains $(s,a)$ has been included in order to compensate for the ``discretization error.''~Consider the following collection of MDPs $\cM^{+}_t := \{(\cS_t, \cA_t, \tilde{p}, \tilde{r}_t) : \tilde{p} \in \cC_t\}$.~One may view $\cM^{+}_t$ as an MDP with the finite state space $\cS_t$ and an extended action space, hence the name extended MDP. An element from the extended action space has two components: control input from $\cA_t$, and a transition kernel from $\cC_t$. Let $\Phi_t$ be the set of those policies $\phi$ that satisfy $\phi(s) \in \cA_t(s),~\forall s \in \cS_t$.~Denote the optimal average reward of $\cM^+_t$ by $J\ust_{\cM^+_t}$.~\algo~uses the \evi~algorithm in order to obtain an optimal policy for the extended MDP at the beginning of every episode.~This is discussed next.

\begin{algorithm}[ht]
    \caption{Extended Value Iteration~(\evi)}
    \label{algo:evi}
    \begin{algorithmic}
        \STATE {\bfseries Input} Extended MDP $\cM^+$, accuracy parameter $\gamma > 0$.
        \STATE {\bfseries Initialize} $v_0 = \{0\}^{\abs{S}}$, $n = 0$.
        \WHILE{True}
            \STATE $v_{n+1} = \cT v_n$~\eqref{def:T_v}
            \IF{$\spn{v_{n+1} - v_{n}} \leq \gamma$}
                \STATE {\bfseries break}
            \ENDIF
            \STATE $n \leftarrow n+1$
        \ENDWHILE
        \RETURN Greedy Policy w.r.t. $v_n$ %$G v_n$~\eqref{def:G}
    \end{algorithmic}
\end{algorithm}
\begin{algorithm}[ht]
    \caption{Extended Policy Evaluation~(\epe)}
    \label{algo:epe}
    \begin{algorithmic}
        \STATE {\bfseries Input} Extended MDP $\cM^+$, policy $\phi$, accuracy parameter $\gamma > 0$, reference state $s\lst$.
        \STATE {\bfseries Initialize} $v_0 = \{0\}^{\abs{S}}$, $n = 0$.
        \WHILE{True}
            \STATE $v_{n+1} = \underset{\te \in \cC}{\max}\Big\{\tilde{r}(s,\phi(s)) +\underset{s\up \in S}{\sum}{\te(s,\phi(s),s\up) v_n(s\up)}\Big\}$
            \IF{$\spn{v_{n+1} - v_{n}} \leq \br{v_{n+1}(s\lst) - v_{n}(s\lst)}\gamma$}
                \STATE {\bfseries break}
            \ENDIF
            \STATE $n \leftarrow n+1$
        \ENDWHILE
        \RETURN $v_{n+1}(s\lst) - v_{n}(s\lst)$
    \end{algorithmic}
\end{algorithm}

\textbf{\evi} (Algorithm~\ref{algo:evi}) takes as input an extended MDP, and an error tolerance parameter $\gamma > 0$, and returns a policy whose average reward is $\gamma$-close to the optimal value of the extended MDP.~A generic extended MDP $\cM^+ = \{(S, A, \tilde{p}, \tilde{r}) : \tilde{p} \in \cC\}$ has a discrete state space $S$, and discrete action space $A = \{ A(s) : s \in S\}$ where $A(s)$ is the set of actions that are permissible in state $s$.~$\cC$ is a set of transition kernels which yield a distribution over $S$ for each point in $S \times A$.~$\tilde{r}$ is the reward function.~Given the extended MDP $\cM^+$, define the following operator $\cT: \bR^{S}\mapsto\bR^{S}$,
\begin{align}
    \cT v(s) =  \max_{\substack{a \in A(s)\\ \te \in \cC}}\Big\{\tilde{r}(s,a) + \sum_{s\up \in S}{\te(s,a,s\up) v(s\up)}\Big\}.\label{def:T_v}
\end{align}
See that $\mathcal{T}$ is the Bellman operator~\citep{puterman2014markov} for the extended MDP, $\mathcal{M}^+$, where maximization of the value is done over the extended action space, $A(s) \times \mathcal{C}$. Recall that the Bellman operator for usual MDPs maximizes over the set of all actions. At time $\tau_k$, \algo~calls \evi$(M^+_{\tau_k}, 1/\sqrt{T})$. The \evi~subroutine then applies the Bellman operator~\eqref{def:T_v} for $M^+_{\tau_k}$ repetitively until stopping criterion is met and returns the policy $\tilde{\phi}_k \in \Phi_{\tau_k}$, which is $1/\sqrt{T}$-near optimal (Lemma~\ref{lem:conv_evi}). \algo~then extends $\tilde{\phi}_k$ on the entire continuous space $\cS$ to obtain $\phi_k$ as follows: for every state in the $\cS$-cell $\xi \in \cQ_{\tau_k}$, $\phi_k$ plays $\tilde{\phi}_k(q(\xi))$, i.e.,
\begin{align}
    \phi_k(s) = \tilde{\phi}_k(q(\xi)),\forall s \in \xi,~\xi \in \cQ_{\tau_k}.  \label{map:phi_varphi}
\end{align}
\textbf{Episode Duration.}~\algo~chooses the duration of the $k$-th episode as a function of the expected diameter of the states visited at stationarity of the chosen policy, $\phi_k$. Define the extended MDP $\cM^{d,+}_{t} = \{(\cS_t, \cA_t, \tilde{p}, d_t) : \tilde{p} \in \cC_t\}$, where
\begin{align*}
    d_t(s,a) := \diamc{q\inv_t(s,a)},~\forall (s,a) \in \cS_t \times \cA_t.
\end{align*}
Let $\tilde{\phi} \in \Phi_t$.~We define the proxy diameter of $\tilde{\phi}$ at time $t$ as the average reward of the policy $\tilde{\phi}$ evaluated on MDP $\cM^{d,+}_{t}$ and denote it by $\pdiam{t}{\tilde{\phi}}$.~To be precise, $\pdiam{t}{\tilde{\phi}}$ is the optimal value of $\cM^{d,+}_{t}$ when the control input component of the extended action is chosen according to the policy $\tilde{\phi}$, and the transition kernel is chosen so as to maximize the average reward.~Define the diameter of a policy $\phi \in \Phi_{SD}$ at time $t$ as follows:
\begin{align}
    \diam{t}{\phi} := \int_{\cS}{\diamc{q\inv_t(s,\phi(s))} \mu\uc{\infty}_{\phi,p}(s) ds}.\label{def:diam_pol}
\end{align}
In Appendix~\ref{app:prop_pdiam}, we show that $\pdiam{\tau_k}{\tilde{\phi}_k}$ is a tight upper-bound of $\diam{\tau_k}{\phi_k}$ for every $k$.~The duration of the $k$-th episode, $H_k$ is chosen as,
\begin{align}\label{def:epi_dur}
    H_k = \frac{C_H \log{\br{T/\delta}}}{\pdiam{\tau_k}{\tilde{\phi}_k}^{2(d_\cS + 1)}},
\end{align}
where $C_H$, a problem-dependent quantity of $\cO(\log(T))$, satisfies \eqref{def:CH}. This choice of episode duration ensures a reduction of the diameter of the chosen policy in every episode. \algo~uses \epe~(Algorithm~\ref{algo:epe}) in order to compute $\pdiam{\tau_k}{\tilde{\phi}_k}$. $\epe(\cM^{d,+}_{t}, \tilde{\phi}, \gamma, s\lst)$ returns a value from $[\br{1+\gamma}\inv\pdiam{t}{\tilde{\phi}}, \br{1-\gamma}\inv\pdiam{t}{\tilde{\phi}}]$~(Corollary~\ref{cor:conv_epe}) for any $\tilde{\phi} \in \Phi_t$ where $\gamma$ is a parameter chosen by the agent.

\begin{algorithm}[ht]
    \caption{Zooming Algorithm for RL~(\algo)}
    \label{algo:zorl}
    \begin{algorithmic}
        \STATE {\bfseries Input} Horizon $T$, upper-bounds on $L_r$, $L_p$, $C_p$, constants $c_a$, $C_H$ and accuracy parameter $\gamma > 0$
        \STATE {\bfseries Initialize} $h=0$, $k=0$, $H_0 = 0$, $\cP_0 = \{\cS \times \cA\}$
        \FOR{$t= 0$ to $T-1$}
            \IF{$h \geq H_k$}
                \STATE $k \leftarrow k+1$, $h \leftarrow 0$, $\tau_k = t$, $s\lst \in \cS_t$
                \STATE Construct $\cM^{+}_{\tau_k}$ and $\cM^{d,+}_{\tau_k}$
                \STATE $\tilde{\phi}_k = \evi(\cM^+_{\tau_k}, 1/\sqrt{T})$
                \STATE Obtain $\phi_k$ from $\tilde{\phi}_k$ according to \eqref{map:phi_varphi}
                \STATE $d_k = \epe(\cM^{d,+}_{\tau_k}, \tilde{\phi_k}, \gamma, s\lst)$
                \STATE $H_k = C_H \log{\br{T/\delta}}~ d_k^{-2(d_\cS + 1)}$
            \ENDIF
            \STATE $h \leftarrow h+1$
            \STATE Play $a_t = \phi_k(s_t)$, observe $s_{t+1}$ and receive $r(s_t, a_t)$
            \IF{$N_t(q\inv_t(s_t, a_t)) = N_{\max}(q\inv_t(s_t, a_t))$}
                \STATE $\cP_{t+1} = \cP_t \cup \textit{Child}(q\inv_t(s_t, a_t)) \setminus \{q\inv_t(s_t, a_t)\}$
            \ELSE
                \STATE $\cP_{t+1} = \cP_t$
            \ENDIF
        \ENDFOR
	\end{algorithmic}
\end{algorithm}