%is an effective tool to tackle large state spaces, allowing generalization between states and enjoying
The design of algorithms for RL problems involving large state spaces has been of great interest in recent years. As traditional tabular methods are intractable in this setting, algorithms that use function approximation to generalize across states have gained substantial attention. In particular, non-linear function approximation has demonstrated strong empirical successes \citep{he2024adaptive,zhang2022making} with provably efficient algorithms emerging \citep{agarwal2020flambe, uehara2021representation, modi2024model}.
 
 Furthermore, in many RL applications, there is a common expectation that a good RL algorithm will eventually gain enough information to identify optimal behavior in finite time \citep{zhang2024achieving}. In that regard, a key question is under which assumptions this expectation can be confirmed theoretically. %In particular, the potential benefits of employing representation learning for more efficient exploration remain underexplored. 
 %In particular, the question of how function approximation can be used to provably identify optimal behavior remains underexplored.

Recently, \citet{jin2020provably} have shown that sample-efficient learning in large state-action spaces is possible in linear Markov decision processes (MDPs), where the transition operator \(\mathcal{P}\) admits a low-rank decomposition \(\mathcal{P}(s'|s,a)=\langle\phi(s,a),\mu(s')\rangle\) into (known) features \(\phi\) and (unknown) signed measures \(\mu\). In this setting, \citet{papini2021reinforcement} showed that features that fulfill a spectral property called UniSOFT (see Definition~\ref{def:unisoft}) are necessary and sufficient for constant instance-dependent regret, i.e., the regret does not scale with the number of iterations.

Similarly, in contextual linear bandits (CLB), where the reward function is linear in the features \(\phi\), \citet{papini2021leveraging} showed that a diversity condition called HLS \citep{hao2020adaptive}, is necessary and sufficient for constant instance-dependent regret. \citet{tirinzoni2022scalable} were able to provide an algorithm that achieves constant instance-dependent regret for CLBs, even when the true features \(\phi\) are unknown and must be learned over some (known) finite function class.

To the best of our knowledge, there exists neither an instance-dependent result nor an algorithm that achieves constant regret for low-rank MDPs~\citep{agarwal2020flambe}; that is, linear MDPs with unknown features \(\phi\). 

In this work, we study low-rank MDPs and aim to close this gap by addressing the following research question.

\textbf{Can we achieve constant instance-dependent regret in low-rank MDPs?}
% Which assumptions are sufficient therefore?
% Is it possible to define an oracle-efficient RL algorithm that enjoys constant instance-dependent regret in low-rank MDPs?}

As we shall see, we can answer this question positively. In particular, we provide an instance-dependent analysis of our proposed algorithm \alg, which is an augmented version of the recently proposed \textsc{REP-UCB} algorithm \citep{uehara2021representation} that serves as the basis for many other works \citep{zhang2022efficient, agarwal2023provable, zhao2024learning} on low-rank MDPs. In our analysis, we leverage the insights of \citet{cheng2023improved}, who designed a UCB-style bonus term that serves as a trajectory-wise uncertainty measure. In particular, we show that the bonus term serves as an almost optimistic estimate of the average sub-optimality gaps. This allows us to perform an instance-dependent regret analysis, similar to \citet{papini2021reinforcement}, employing UniSOFT feature maps. More specifically, we contribute the following:

\begin{itemize}

    \item We provide an algorithm called \alg (Algorithm \ref{alg:UniSREP}) that, for $T$ large enough, achieves $\tilde{O}(\sqrt{T})$ expected regret (Theorem \ref{thm:instance_dependent_regret_bound_with_unisoft_and_constant_pseudo_regret}) provided that the minimal sub-optimality gap (Definition \ref{ass:sub_optimality_gap_exists}) and the minimal optimal occupancy (Definition \ref{ass:min_optimal_occupancy_exists}) are well-defined and we have access to an expressive enough function space (Assumption \ref{ass:expressivness});
    
    \item We design a termination criterion that allows \alg to achieve constant regret (Theorem \ref{thm:optimal_policy_identification}), provided that the minimal sub-optimality gap and the minimal optimal occupancy are known;
    
    \item We demonstrate that the existence of UniSOFT representations is fully characterized by the RL instance (Lemma \ref{lemma:unisoft_existance}). In particular, we show that in low-rank MDPs, feature space coverage is equivalent to state space coverage---a result which can be of interest on its own.
\end{itemize}