\section{Introduction}\label{sec:intro}
Reinforcement Learning (RL)~\citep{sutton2018reinforcement} is a popular model for systems involving real-time sequential decision-making and has applications in many fields such as robotics, natural language processing~\citep{ibarz2021train,sodhi2023effectiveness}.~An agent interacts sequentially with an environment by applying actions and gathers rewards.~The environment is modeled as a Markov decision process (MDP)~\citep{puterman2014markov}, its transition probabilities are not known to the agent.~Its goal is to choose actions sequentially so as to maximize the cumulative rewards.


The current work develops an RL algorithm for infinite-horizon average reward Lipschitz MDPs on metric spaces.~Popular frameworks such as tabular and linear MDPs that have been well-studied in detail in RL literature, are not suitable for real-world applications since these typically involve nonlinear systems that reside on continuous spaces~\citep{kumar2021rma}.~For continuous spaces,~the learning regret could grow linearly with time horizon $T$ unless the problem has some structure~\citep{kleinberg2008multi}.~Hence, we focus on Lipschitz MDPs, which is a very general class and subsumes several popular classes such as linear MDPs~\citep{jin2020provably}, RKHS MDPs~\citep{chowdhury2019online}, linear mixture models, RKHS approximation, and the nonlinear function approximation framework~\citep{osband2014model,kakade2020information}.~See~\cite{maran2024no,maran2024projection} for more details. 

Throughout, we use $d_{\cS},d_{\cA}$ to denote the dimensions of the state-space and the action-space respectively, and $d:=d_{\cS}+d_{\cA}$.~In episodic RL for Lipschitz MDPs, the regret is known to scale as $\ctO\big(K^{1 - \deff\inv}\big)$\footnote{$\ctO$ suppresses poly-logarithmic dependence in $K$ or $T$.}, where $K$ is the number of episodes, while $\deff$ is the effective dimension associated with the \textit{underlying MDP} and also importantly the \textit{algorithm}.~A naive algorithm that uses a fixed discretization has $\deff=d+2$~\citep{song2019efficient}. One can use problem structure to reduce $\deff$; prior works on episodic Lipschitz MDPs such as~\citet{sinclair2019adaptive, cao2020provably} reduce effective dimension to $d_z+2$, where the zooming dimension $d_z$ measures the size of the near-optimal state-action pairs.~These gains are achieved by performing an adaptive discretization of the state-action space and ``zooming in'' to only the promising regions of the state-action space by creating a finer grid around these as time progresses.~However,~\cite{kar2024adaptive} show that zooming technique and algorithms developed for episodic MDPs are inappropriate for average reward RL tasks, in that $d_z\to d$ as $T\to\infty$, which is what one would have obtained via a naive fixed discretization scheme.~\cite{kar2024adaptive} derives an $\cO(\eps^{2 d_\cS + d^\eps_z + 1} \log{T})$ upper-bound on the regret with respect to an $\eps$~suboptimal comparator policy class, where $d^\eps_z$ is the ``$\eps$-zooming dimension'' and satisfies $d^\eps_z \leq d$. However, $d_z^\eps \to d$ in the limit $\eps \downarrow 0$, which shows that no adaptivity gains are achieved if the policy class contains optimal policy, i.e., one wants to attain optimal performance.~In a later version of the same paper,~\cite{kar2024policy} rectifies this issue to some extent by competing against an optimal policy class. They work directly in the policy space, and show zooming behavior in this space rather than the state-action space, i.e., their algorithm ``activates'' more number of policies from the near-optimal regions in the policy space.~They obtain $\deff = d^\Phi_z + 2$, where $d^\Phi_z$ measures the size of near-optimal policies in the set of policies $\Phi$ that can be chosen.~$d^\Phi_z$ is the log-covering number of the set consisting of $(\beta, 2\beta]$-suboptimal policies in $\Phi$.~However, $d^\Phi_z$ can be prohibitively large if either the MDP or the policy-set $\Phi$ is not structured, since it involves coverings in function spaces~\citep{guntuboyina2012l1}. The current work remedies this and upper-bounds the regret in terms of an alternative notion of zooming dimension, one that can be bounded by $d$ in the worst case.~Though the analysis of our algorithm is performed in the policy space, it relates the suboptimality of a policy with that of the associated state-action pairs, thereby deriving an upper-bound of the number of plays of suboptimal policies in terms of coverings of the state-action space.

\subsection{Contributions}
\label{subsec:contribution}
We propose a computationally efficient algorithm~\algo~for Lipschitz MDPs in the infinite-horizon average reward RL setup.~\algo~combines adaptive discretization with the principle of optimism and yields zooming behavior.~We provide a regret upper-bound of~\algo~as a function of the zooming dimension $d_z$, where $d_z$ is defined in terms of the suboptimality gap of the state action pairs~\eqref{def:subgap}. We show that the regret of~\algo~is upper-bounded as $\ctO\big(T^{1 - \deff\inv}\big)$, where $\deff = 2 d_\cS + d_z + 3$, and $d_z \le d$. In order to attain a low $\deff$, we had to overcome several challenges.~These are discussed in detail below.
\begin{enumerate}
    \item \textit{Bypassing Policy Covers}:~As is discussed above, working with policy coverings could lead to a large $\deff$.~Let $\Phi\uc{\beta}$ denote the set of all $(\beta, 2\beta]$-suboptimal policies.~By establishing an upper-bound on the total number of plays of $\Phi\uc{\beta}$ in terms of the $\beta$-covering number of the set of all $\beta$-suboptimal state-action pairs, the current work attains a small $\deff$.~Our proof hinges on the existence of certain ``key cells.''~More specifically, we show that whenever~\algo~plays a suboptimal policy $\phi$, there exists a ball in the state-action space that satisfies the following two properties: 
    (i) it has not been visited sufficiently many times, and (ii) the stationary measure under $\phi$ assigns a large probability mass to it.~Such a ball is called a ``key cell'' for that particular episode, see Fig.~\ref{fig:keycell}.~Lemma~\ref{lem:gap_phi} unveils a relation between the suboptimality of a policy, and the suboptimality gap of the state-action pairs through which this policy passes. This result plays a crucial role in proving the existence of key cells. We derive an upper-bound on the number of plays of a cell during which it is a key cell and policies from $\Phi\uc{\beta}$ are played; here $\beta$ can be chosen from $(0,1]$. This upper-bound helps us to express the regret in terms of a covering of a state-action space, which yields a bound that depends upon the zooming dimension~\eqref{def:zoomingdim}.
    
    \item \textit{Adaptive Episode Durations}: In order to attain $\deff = 2 d_{\cS} + d_z + 3$, we have to ensure that with a high probability, the key cells are visited at least a certain number of times in each episode.~This is achieved by choosing the episode durations as a function of the ``proxy diameter'' of the policy that is played currently. We note that the popular approaches for choosing episode duration, such as ending the episode upon doubling the number of visits to any cell, would fail to yield $\deff = 2 d_{\cS} + d_z + 3$.
\end{enumerate}
We verify the gains of \algo~over both popular fixed discretization-based algorithms and existing adaptive discretization-based algorithms through simulation experiments.
\subsection{Past Works}
\textit{\underline{Lipschitz Bandits}}: The idea of zooming was first proposed in ~\citep{kleinberg2008multi} for Lipschitz multi-armed bandits.~\citet{bubeck2011x} proposed a similar idea that uses a hierarchical partition of the arm space to perform adaptive discretization.

\textit{\underline{Lipschitz MDPs}}:~\citet{domingues2021kernel} uses smoothing kernels in order to construct model estimates and obtain $\ctO\Big(H^3 K^{1 - (2d+1)\inv}\Big)$ regret.~Provable gains arising due to adaptive discretization and zooming is first demonstrated in \citep{cao2020provably}. They obtain $\ctO\Big(H^{2.5+(2 d_z + 4)\inv} K^{1 - (2d_z+1)\inv}\Big)$ regret, where $d_z$ is the zooming dimension defined specifically for episodic RL.~In another work, \citet{sinclair2023adaptive} proposes a model-based algorithm with adaptive discretization and shows the regret to be upper-bounded as $\ctO\Big(L_v H^\frac{3}{2} K^{1 - (d_z + d_\cS)\inv}\Big)$, where $L_v$ is the Lipschitz constant for the value function.~As compared the general function approximation-based works, regret bounds obtained in works on Lipschitz MDPs have a worse growth rate as a function of time horizon. However, this is expected since Lipschitz MDPs are a more general class of MDPs and have a regret lower-bound of $\Omega(K^{1 - (d_z+2)\inv})$~\citep{sinclair2023adaptive}.

\textit{\underline{Non-episodic RL}}:~The minimax regret of state-of-the-art algorithms for finite MDPs~\citep{jaksch2010near,tossou2019near} with $S$ states and $A$ actions is bounded as $\ctO(\sqrt{DSAT})$ where $D$ is the diameter of the MDP.~For finite MDPs in which the transition kernel is a mixture of $d$ component transition kernels, regret is upper-bounded as $\ctO(d \sqrt{D T})$~\citep{wu2022nearly}.~The current work develops algorithm for continuous MDPs.~\cite{wei2021learning} analyzes continuous MDPs under the assumption that the relative value function is a linear function of the features, and obtains a $\ctO(\sqrt{T})$ regret.~Another work,~\citet{he2023sample} approximates the MDP, as well as the value function by using general function classes. They derive a regret upper-bound of $\ctO(\textit{poly}(d_E, B) \sqrt{d_F T})$ regret, where $B$ is the span of the relative value function, $d_E,d_F$ are the eluder dimension and log-covering number of the function class, respectively. When the underlying continuous MDP has a $\alpha$-H\"older continuous and infinitely often smoothly differentiable transition kernel, then~\citet{ortner2012online} shows how to obtain a $\ctO\Big(T^\frac{2d + \alpha}{2d + 2 \alpha}\Big)$ regret.~To the best of our knowledge, only~\citep{kar2024adaptive,kar2024policy}\footnote{~\cite{kar2024policy} is a later version of the same paper~\cite{kar2024adaptive}.} have studied adaptive discretization for average reward Lipschitz MDPs; however, they analyze regret with respect to a given class of policies.~For~\citep{kar2024adaptive}, when this class is ``sufficiently rich'' so that it contains an optimal policy, then their algorithm does not exhibit adaptivity gains, i.e., their zooming dimension reduces to $d$, which is what one would attain via a fixed discretization scheme.~In~\cite{kar2024policy}, the zooming dimension could be even larger than $d$ if the policy class is complex.
