\paragraph{Linear MDPs}

\cite{jin2020provably} proposed the first sample-efficient algorithm for linear MDPs without assuming access to a generative model or other restrictive assumptions on the transition operator. Their algorithm \mbox{\textsc{LSVI-UCB}} combines classical LSVI with UCB-style bonuses and achieves \(\tilde{O}(\sqrt{T})\) worst-case regret. Later, \cite{he2021logarithmic} provided the first instance-dependent regret analysis for linear MDPs, achieving a logarithmic \(O(\Delta_{\textnormal{min}}^{-1}\log(T))\) instance-dependent regret bound. Using features that satisfy a diversity condition, called UniSOFT (Definition \ref{def:unisoft}), \cite{papini2021reinforcement} showed that \textsc{LSVI-UCB} enjoys constant instance-dependent regret. In addition, they demonstrate that the UniSOFT property is necessary for constant expected regret, reinforcing the importance of good features. Using a similar diversity condition, in bilinear MDPs, \cite{zhang2023provably} provided an algorithm that enjoys constant instance-dependent regret. However, both methods do not scale to large function classes or misspecified representations. Recently, \cite{zhang2024achieving} were able to provide an algorithm that achieves constant regret without prior assumption on the features. Remarkably, their result holds even if features have low point-wise misspecification w.r.t.\ the minimal sub-optimality gap.

\paragraph{Low-Rank MDPs}
In the much more challenging low-rank MDP setting, the seminal work of \cite{agarwal2020flambe} provided the first reward-free oracle-efficient algorithm called \textsc{FLAMBE}. They proposed learning representations using maximum likelihood estimation (MLE) and showed that their explore-then-commit style algorithm achieves polynomial sample complexity when provided with an MLE oracle. By interleaving representation learning, exploration, and exploitation, \cite{uehara2021representation} provided an algorithm called \textsc{REP-UCB} that improves the sample complexity bound of \textsc{FLAMBE} in every relevant variable under the same MLE oracle assumptions. In particular, they employ an UCB-style bonus term, which provides optimism at the initial state distribution. Recently, \cite{cheng2023improved} showed that this bonus term can also serve as a trajectory-wise uncertainty measure. They leverage this insight to design a value function that encourages exploration in the state-action space where the uncertainty in the model estimation error is large and subsequently, provide an improved sample complexity bound. Finally, \cite{zhao2024learning} provided the first regret bound for low-rank MDPs, employing a double exploration strategy. However, to the best of our knowledge, in contrast to linear MDPs, there exists no instance-dependent regret bound for low-rank MDPs. Furthermore, under which conditions, constant regret is achievable is still an open problem.

\paragraph{Contextual Linear Bandits}
In contextual linear bandits (CLB), \cite{papini2021leveraging} showed that a diversity condition called HLS \citep{hao2020adaptive}, similar to the UniSOFT property, is necessary and sufficient for constant instance-dependent regret. Relaxing the assumption of exact feature maps, \cite{tirinzoni2022scalable} provided an algorithm which achieves constant regret, introducing a constrained optimization objective which encourages the HLS property and enforces the representations to be exact.