\section{Regret Analysis}\label{sec:regret}
We let $\Delta(\phi) := J\ust_{\cM} - J_{\cM}(\phi)$ denote the suboptimality of policy $\phi$.~The following result establishes a relation between the suboptimality of a policy, and the suboptimality gap of the state-action pairs through which this policy passes, where suboptimality gap of state-action pair is defined in \eqref{def:subgap}.~Its proof is deferred to Appendix~\ref{app:gen_res}.
\begin{lemma}\label{lem:gap_phi}
    Consider the MDP $\cM = (\cS, \cA, p,r)$. For any policy $\phi \in\Phi_{SD}$, we have
    \begin{align*}
        \Delta(\phi) = \int_{\cS}{\gap{s,\phi(s)}~ \mu\uc{\infty}_{\phi,p}(s)~ ds}.
    \end{align*}
\end{lemma}
We make the following assumption on the true kernel $p$ for deriving concentration bound for the estimate of the discretized transition kernel $\wp_{\cS_t \times \cA_t \to \cS_t,p}$~\eqref{def:disc_p}.
\begin{assum}[Bounded Radon-Nikodym derivative]\label{assum:bdd_der}
    The probability measures $\{p(s,a,\cdot)\}$ are absolutely-continuous w.r.t. the Lebesgue measure on $(\cS,\cB_\cS)$, with density functions given by $\{f_{(s,a)}\}$.~We assume that these densities satisfy 
    \begin{align*}
        \norm{\frac{\partial f_{(s,a)}(s^+)}{\partial s^+(i)}}_\infty \leq C_p, \forall (s,a) \in \cS \times \cA, i = 1, 2, \ldots,  d_\cS,
    \end{align*}
    where the variable $s^+ = (s^+(1),s^+(2),\cdots,s^+(d_\cS))$ represents the next state.
\end{assum}
Assumption~\ref{assum:bdd_der} ensures that the discretizations of $p(s,a,\cdot)$ with respect to the partitions $\cQ^{(\ell(q\inv_t(s,a)))}$ and $\cQ_t$ are at most $C_p~ \diamc{q\inv_t(s,a)}$ distance apart (Lemma~\ref{lem:disc_dist}).~Using this result, Lemma~\ref{lem:conc_ineq} shows that under Assumption~\ref{assum:lip} and Assumption~\ref{assum:bdd_der}, $\cap_{t=0}^{T-1}\{\wp_{\cS_t \times \cA_t \to \cS_t, p} \in \cC_t\}$ occurs w.h.p.~The following assumption allows us to derive an upper-bound on the span of the EVI iterates, which is essential to ensure that the algorithm is not overly optimistic.
\begin{assum}[Bound on Stationary Distributions]\label{assum:statn_dist}
    There is a constant $\kappa > 0$ such that for every policy $\phi \in \Phi_{SD}$, and for every $\zeta \in \cB_\cS$, we have, $\kappa \cdot \lambda(\zeta) \leq \mu^{(\infty)}_{\phi,p}(\zeta)$, where $\lambda(\cdot)$ denotes the Lebesgue measure on $(\cS,\cB_\cS)$.
\end{assum}
\begin{remark}[Regarding Assumptions]
    In the average reward setup for continuous space MDPs, assumptions similar to Assumption~\ref{assum:statn_dist} or more restrictive assumptions are needed.~For example, \citet{ormoneit2002kernel} assumes that the transition kernel of the underlying MDP has a strictly positive Radon-Nikodyn derivative in order to show that a proposed adaptive policy converges to an optimal policy.~\citet{wang2023optimal} and \citet{shah2018q} derive optimal sample complexity for average reward RL and for discounted reward RL, respectively, under an assumption that the $m$-step transition kernel is bounded below by a known measure.~\citet{kar2024policy} also make the same assumption as ours in order to derive the regret upper-bound of their adaptive discretization-based algorithm.~\citet{wei2021learning} bounds the regret for average reward RL algorithm when the relative value function is a linear function of a set of known feature maps.~Their ``uniformly excited features'' assumption ensures that upon playing any policy, the confidence ball shrinks in each direction, which has a similar effect as Assumption~\ref{assum:statn_dist}.
\end{remark}
We now present our main result that provides an upper-bound on regret of \algo. We only provide a proof sketch here and delegate its detailed proof to the appendix.

\begin{thm}\label{thm:regupperbound}
     Under Assumptions~\ref{assum:lip}, \ref{assum:unif_ergodic}, \ref{assum:bdd_der} and \ref{assum:statn_dist}, with probability at least $1 - \delta$, $\cR(T;\algo)$ is upper-bounded as $\cO\big(T^{1-\deff\inv}\big)$ where $\deff = 2 d_\cS + d_z + 3$. 
\end{thm}

\begin{proof}[Proof sketch]
    We decompose the regret~\eqref{def:regret} in the following manner.~Let $K(T)$ denote the total number of episodes during $T$ timesteps.~Then, 
    \begingroup
    \allowdisplaybreaks
    \begin{align*}
       & \cR(T;\algo) = T J\ust_{\cM} - \sum_{k=1}^{K(T)}{\sum_{t = \tau_k}^{\tau_{k+1}-1}{r(s_t,a_t)}} \notag\\
        &= \underbrace{\sum_{k=1}^{K(T)}{H_k \br{J\ust_{\cM} - J_{\cM}(\phi_k)}}}_{(a)} \\
        &+ \underbrace{\sum_{k=1}^{K(T)}{\br{H_k~ J_{\cM}(\phi_k) - \sum_{t=\tau_k}^{\tau_{k+1}-1}{r(s_t,\phi_k(s_t))}}}}_{(b)}.
    \end{align*}
    \endgroup
    (a) captures the regret arising due to playing a suboptimal policy $\phi_k$ during the $k$-th episode, while (b) captures the possible degradation in performance during the transient stage as compared with the average rewards of the chosen policies. (a) and (b) are bounded separately below.

    \textbf{Bounding} (a): Step 1: In Lemma~\ref{lem:optimism}, we show that the policy obtained by solving~$\cM^+_t$ is optimistic, i.e., w.h.p. $J^{\star}_{\cM^+_t} \geq J\ust_{\cM}$.~Also, in Lemma~\ref{lem:ub_opt}, we show that w.h.p., $J^{\star}_{\cM^+_t} \leq J\ust_{\cM} + C_{ub}~ \diam{t}{\phi_k}$, where $C_{ub}$~is as defined in \eqref{def:Cub}.~As a consequence of the above two results, on a high probability set, a suboptimal policy $\phi$ will never be played from episode $k$ onwards if $\diam{\tau_k}{\phi} \leq C_{ub}\inv \cdot \Delta(\phi)$.~Note that the cumulative regret arising due to policies with $\Delta(\cdot)$ less than $\eps$ is at most $\eps T$. We choose $\eps$ optimally and restrict the analysis to regret arising from playing other policies.
    
    Step 2: We combine Step 1 with Lemma~\ref{lem:gap_phi} in Lemma~\ref{lem:keycell} and show that on a high probability set, in each episode $k$, there is a state $s \in \cS$ such that
    \begin{subequations}
        \begin{align}
            &\diamc{\zeta} \geq   \notag\\
            &\frac{1}{3 C_{ub}}\max\{\gap{s,\phi_k(s)}, C_{ub} \diam{\tau_k}{\phi_k}\}, \label{keycell:cond1}\\
            &\mu\uc{\infty}_{\phi_k,p}(\pi_\cS(\zeta)) \geq \br{\frac{\diam{\tau_k}{\phi_k}}{3}}^{d_\cS + 1},\label{keycell:cond2}
        \end{align}
    \end{subequations}
    where $\zeta = q\inv_{\tau_k}(s,\phi_k(s))$.~This cell $\zeta$ is called a key cell in the $k$-th episode.
    \begin{figure}[t]
        \centering
        \includegraphics[width=0.7\linewidth]{figures/key_cell.pdf}
        \caption{Key cell: The policy $\phi$ is played during the $k$-th episode. This diagram depicts the discretization grid at the beginning of the $k$-th episode.~Then, one of the cells $\zeta_{1,1}$, $\zeta_{2,3}$ and $\zeta_{2,4}$ must be a key cell with a high probability (Lemma~\ref{lem:keycell}). There must be a state $s$ such that $(s,\phi(s))$ belongs to this cell, and $s$ satisfies \eqref{keycell:cond1} and \eqref{keycell:cond2}.}
        \label{fig:keycell}
    \end{figure}
    
    Step 3: Then we show that with a high probability, the key cells of the $k$-th episode are visited at least $\cO\br{\log{\br{\frac{T}{\delta}}} \diamc{\zeta}^{-(d_\cS + 1)}}$ times during the $k$-th episode.~This is done in Lemma~\ref{lem:lb_num_visit}.

    Step 4: We obtain a bound on the cardinality of the key cells associated with playing policies from the set $\Phi_{2^{-i}} = \{\phi \in \Phi_{SD} \mid \Delta(\phi) \in (2^{-i}, 2^{-i+1}]\}$ by showing that these cells are contained within a set of cells that has a cardinality at most $\cO(2^{id_z})$. We then use this bound along with the lower-bound on the number of plays of the key cells, and conclude that the policies from $\Phi_{2^{-i}}$ are played for a maximum of $\cO\br{\log{\br{\frac{T}{\delta}}} 2^{i(2 d_\cS + d_z + 3)}}$ time-steps~(Lemma~\ref{lem:bdd_Phi_play}).
    
    Step 5: The term~(a) can be written as the sum of the regrets arising due to playing policies from the sets $\Phi_{2^{-i}}$, where $i=1,2,\ldots,\ceil{\log{\br{\frac{1}{\eps}}}}$, where $\eps = T^{-\frac{1}{2 d_\cS + d_z + 3}}$.~To bound the regret arising due to playing policies from $\Phi_{2^{-i}}$, we multiply $\cO\br{\log{\br{\frac{T}{\delta}}} 2^{i(2 d_\cS + d_z + 3}}$ by $2^{-i + 1}$.~We then add these regret terms from $i=1$ to $\ceil{\log{\br{\frac{1}{\eps}}}}$ and $\eps T$.
    
    Step 6: Lastly, we add $\sqrt{T}$ to the final bound to compensate for the inaccuracy caused by \evi~due to finite computational resources. This gives us the upper-bound on (a) w.h.p.

    \textbf{Bounding} (b): upper-bound on the term $(b)$ relies on the uniform ergodicity property~(Assumption~\ref{assum:unif_ergodic}) of $\cM$ and a trick that converts ``Markovian noise'' to ``martingale noise''~\citep{metivier1984applications}.~Proposition~\ref{prop:bddb} shows that on a high probability set, we must pay a constant penalty each time we change policy, which is $\cO(K(T) + \sqrt{T})$.~We show that the rule which decides when to start a new episode ensures that $K(T)$ is bounded above by $\cO(T^\frac{d_z + 1}{2 d_\cS + d_z + 3})$, and so is the term (b).

    Summing the upper-bounds on (a) and (b), we obtain the desired regret bound.
\end{proof}