\subsection{Comparison with the Literature}

In this subsection, we compare the constant regret result of Theorem \ref{thm:optimal_policy_identification} with related results from the literature. In Table \ref{tab:critical_episodes} we provide an overview of algorithms achieving constant regret in different learning settings and compare their critical episodes; that is, the episode after which, with high probability, the respective algorithm does not incur additional regret.

\begin{table*}
    \caption{Comparison of critical episodes; for ease of comparison, constants that refer to eigenvalue sizes are summarized with $\lambda^{\star}$.}
    \label{tab:critical_episodes}
\centering
\begin{tabular}{cccc}
 \toprule
 Algorithm & Setting & Features $\phi$ & Critical Episode \\\midrule
 %
 LEADER \citep{papini2021leveraging} & CLB & Known & \(\tilde{O}((\frac{d}{\lambda^{\star}\Delta_{\textnormal{min}}})^{2})\)
 \\ [2ex]
 %
 BanditSRL \citep{tirinzoni2022scalable} & CLB & Unknown & \(\tilde{O}(\frac{d^{2}}{(\lambda^{\star}\Delta_{\textnormal{min}})^{2}\epsilon_{\textnormal{min}}})\) \\ [2ex] 
 %
 LSVI-LEADER \citep{papini2021reinforcement} & Linear MDP & Known & \(\tilde{O}(\max\{\frac{d^{3}H^{4}}{(\lambda^{\star})^{2}}, \frac{d^{2}H^{4}}{\Delta_{\textnormal{min}}^{2}(\lambda^{\star})^{3}}\})\) \\ [2ex] 
 %
 \color{blue} \alg+ (this work) & \color{blue} 
 Low-rank MDP & \color{blue} Unknown & \color{blue} 
 \(\tilde{O}(\frac{H^{12}d^{8}|\mathcal{A}|^{4}}{(\Delta_{\textnormal{min}}d_{\textnormal{min}}^{\star})^{8}(\alpha\lambda^{\star})^{4}})\)\\
\bottomrule
\end{tabular}
\end{table*}

\paragraph{LSVI-LEADER \citep{papini2021reinforcement}}
In the linear MDP setting, the LSVI-LEADER algorithm proposed by \cite{papini2021reinforcement} assumes access to a set of realizable representations containing one UniSOFT representation, and that the unique optimal policy assumption \ref{ass:unique_optimal_policy} holds. However, their algorithm does not scale to large function spaces, as it learns a different representation for each state-action pair.

In comparison, \alg+ can deal with large function spaces and misspecified representations. Additionally, we show how to generalize our regret bounds beyond the unique optimal policy assumption. However, we assume access to an optimization oracle, positive minimal optimal occupancy and known instance-dependent quantities.

 In Table \ref{tab:critical_episodes} we can see that, in contrast to LSVI-LEADER, the critical episode of \alg+ depends on the size of the action space, which seems to be unavoidable in low-rank MDPs \citep{zhao2024learning}. We additionally incur a dependence on \(d_{\textnormal{min}}^{\star}\), which stems from bounding average sub-optimality gaps and on $\alpha$ as we must select representations with low model error. The overall smaller polynomial dependence for LSVI-LEADER follows from the overall tighter regret bound available for linear MDPs.

\paragraph{BanditSRL \citep{tirinzoni2022scalable}}
In contextual linear bandits (CLB) the feature map $\phi$ must only linearly represent the reward function. Similarly to our work, BanditSRL learns a non-redundant representation with good spectral properties over a known finite function space. They do not rely on any oracle assumptions, as estimating the reward function can be done efficiently by minimizing the MSE. However, they rely on a restrictive misspecification assumption that allows them to eliminate all point-wise misspecified representations. In particular, they assume that the following quantity is well-defined:
\begin{align*}
    \epsilon_{\textnormal{min}}&:=\min_{\phi\in\Phi\setminus\Phi^{\star}}\min_{\theta:\Vert\theta\Vert\leq 1}\min_{\pi:\mathcal{S}\to\mathcal{A}} \\
    &\mathbb{E}_{s\sim d_{1}}[(\langle\phi(s, \pi(s)), \theta\rangle - r^{\star}(s, \pi(s))^{2}] > 0.
\end{align*}
 Although estimating the reward function is conceptually different, we emphasize that our algorithm can deal with misspecified representations without making additional assumptions on the level of misspecification.

\paragraph{Constant Regret with Misspecified Representations}

Interestingly, as far as we know, there exists no algorithm for linear MDPs that can identify optimal behavior, when features are only required to have small misspecification error on average. In fact, only very recently, \cite{agarwal2023provable} provided the first sublinear regret result in this setting. On the other hand, \cite{zhang2024achieving} provided an algorithm that achieves constant instance-dependent regret for linear-MDPs with features that have low point-wise misspecification w.r.t. the minimal sub-optimality gap.

\subsection{Limitations}

\paragraph{Redundant Features}

Following a similar analysis as in \cite{papini2021reinforcement}, our regret bounds would also hold for redundant UniSOFT feature maps, provided that we are guaranteed to select them. In order to learn possibly redundant UniSOFT feature maps, \cite{tirinzoni2022scalable} provided the following loss function:
\begin{align*}
    \min_{(s,a)\in\mathcal{D}}\phi(s,a)^{T}\left(\sum_{(s',a')\in\mathcal{D}}\phi(s',a')\phi(s',a')^{T}\right)\phi(s,a).
\end{align*}
However, this loss function selects UniSOFT feature maps only if all state-action pairs are visited in finite time; otherwise, we cannot ensure that the features of optimal actions span the observable feature space.

\paragraph{Low-Rank Assumption}
The set of MDPs that admit a low-rank representation with small rank $d$ w.r.t. $|\mathcal{S}|$ is inherently limited. In particular, \cite{leedemystifying} showed that the feature dimension is lower bounded by \(\lfloor\frac{|\mathcal{S}|}{U}\rfloor\), where \(U:=\max_{(s,a)\in\mathcal{S}\times\mathcal{A}}|\{s'\in\mathcal{S}:\mathcal{P}(s'| s, a) > 0\}|\) is the maximum number of directly reachable states. An immediate consequence is that, in deterministic environments, \(d=|\mathcal{S}|\) holds. We refer to Section 4 in \cite{leedemystifying} for a more thorough discussion. 

\paragraph{Minimal Optimal Occupancy} In contrast to existing work on constant regret for linear MDPS \citep{papini2021reinforcement, zhang2023provably, zhang2024achieving}, our bound has an additional dependence in $d_{\textnormal{min}}^{\star}$. This dependence is caused by controlling expected sub-optimality gaps. A point-wise uncertainty quantification is generally not possible since the MLE objective is unbounded and we cannot use any standard
uniform convergence techniques. Nevertheless, $d_{\textnormal{min}}^{\star}\approx\lambda^{\star}$ is generally a reasonable approximation, where quantities similar to $\lambda^{\star}$ appear in many existing works (e.g., see Table \ref{tab:critical_episodes}) that leverage representations with good spectral properties.

The inherently undesirable trade-off between $d_{\textnormal{min}}^{\star}$ and $d$ is interesting to note here. We seek highly random transitions to hope for a small rank $d$, but deterministic transitions for a large value $d_{\textnormal{min}}^{\star}$.

\paragraph{Computation}

Computationally, Algorithm \ref{alg:UniSREP} suffers from limitations similar to those of other existing works on low-rank MDPs. In particular, the optimization oracle cannot be efficiently solved accurately, as there is no practical mechanism to guarantee the normalization conditions for \(\phi\) and \(\mu\) \citep{zhang2022making}. This, in particular, makes the constraint optimization objective in algorithm line \ref{alg:oracle} intractable. However, the MLE objective can be approximated with noise contrastive estimation (NCE) \citep{zhang2022making}, with the UniSOFT loss added as a regularization term.