This section provides an algorithm, called \alg (Upper Confidence Driven Universally Spanning Representation Learning, Algorithm \ref{alg:UniSREP}), that achieves sub-linear expected regret under an additional simplifying assumption that guarantees the selection of good representations. Furthermore, we demonstrate that by introducing a carefully chosen termination criterion to \alg, resulting in the algorithm \alg \textcolor{blue}{+} (Algorithm \ref{alg:UniSREP} with modifications shown in blue), we can identify optimal behavior with high probability whenever the minimal sub-optimality gap and the minimal optimal occupancy are known.

\input{uai2025-template/Algorithm}

\subsection{Algorithm}
On a high level, \alg is a finite-horizon adaption of the \textsc{REP-UCB} algorithm proposed by \cite{uehara2021representation}. However, unlike the \textsc{REP-UCB} algorithm, we employ a double exploration scheme to enable a regret bound, as proposed by \cite{zhao2024learning} and augment the representation learning objective to encourage feature maps with good spectral properties.

\paragraph{Exploration (Lines \ref{alg:exploration_start}-\ref{alg:exploration_end})}
For each time step $h$, the algorithm samples the state-occupancy distribution \(d_{\mathcal{P}^{\star}, h-1}^{\pi_{t-1}}\) and continues based on the result of a Bernoulli experiment with a success rate of $1-\xi_{t}$. If successful, the algorithm explores with the behavior policy $\pi_{t-1}$, and otherwise it explores by taking actions uniformly at random. This mechanism is key for enabling a regret bound, as otherwise the algorithm would explore uniformly at random in each episode and time step, preventing sub-linear regret. After time step $h+1$ the algorithm rolls-out to time step $H$ according to $\pi_{t-1}$. Note that we only require the algorithm to interact with the environment in full trajectories due to a technicality when bounding the regret. Qualitatively, the algorithm does not change by resetting after $h+1$ time steps. Finally, we collect the transitions of the time steps $h-1$ and $h$ in separate datasets.

\paragraph{Representation Learning (Lines \ref{alg:replearn_start}-\ref{alg:replearn_end})}
Similarly to \cite{tirinzoni2022scalable}, we employ a constrained optimization objective (Line \ref{alg:oracle}), to learn features that have good spectral properties and approximate the transition operator well enough. We define the following objective functions:
\begin{align}
    &\mathcal{L}^{\text{MLE}}(\phi_{h}, \mu_{h}, \mathcal{D}) = \sum_{(s,a,s')\in\mathcal{D}}\log(\langle \phi_{h}(s,a), \mu_{h}(s')\rangle \label{eq:likelihood_loss}\\
    &\mathcal{L}^{\textnormal{UniSOFT}}(\phi_{h}, \mathcal{D}) = -\lambda_{\textnormal{min}}\left(\sum_{(s,a)\in\mathcal{D}}\phi_{h}(s,a)\phi_{h}(s,a)^{T}\right) \label{eq:unisoft_loss}
\end{align}
Then, the set of representations that are the maximum likelihood solution of fitting the transition operator over some dataset \(\mathcal{D}\), are defined as follows:
\begin{align*}
    \Phi_{h}^{\textnormal{MLE}}(\mathcal{D}) =& \{
            \phi\in\Phi_{h}: \max_{\mu\in\Psi_{h}}\mathcal{L}^{\textnormal{MLE}}(\phi, \mu, \mathcal{D}) \\
            &\qquad= \max_{(\phi',\mu')\in\Phi_{h}\times\Psi_{h}} \mathcal{L}^{\textnormal{MLE}}(\phi', \mu', \mathcal{D}) 
        \}\label{eq:MLE_constraint}
\end{align*}

Similarly to previous work on low-rank MDPs \citep{agarwal2020flambe, uehara2021representation, cheng2023improved}, as a computational abstraction, we assume access to an optimization oracle.
\begin{definition}\label{def:opt_oracle}(Optimization Oracle)
    Consider the function class \(\Phi\times\Psi\) and datasets \(\mathcal{D}\) and \(\mathcal{D}'\) consisting of \((s,a)\) tuples and \((s,a,s')\) triples, respectively. Then, the \emph{optimization oracle} returns for any \(h\in[H]\),
    \[
        \arg\min_{\phi\in\Phi_{h}^{\textnormal{MLE}}(\mathcal{D}')}\mathcal{L}^{\textnormal{UniSOFT}}(\phi, \mathcal{D}).
    \]
\end{definition}

Note that although the oracle is computationally intractable, it can be reasonably well approximated in practice~\citep{tirinzoni2022scalable, zhang2022making}. After employing the oracle, we use the learned features to define an UCB-style bonus term and the estimated transition operator.

\paragraph{Planning (Line \ref{alg:planning})}
We find an optimal (deterministic) policy for the bonus-augmented reward function in the estimated environment. Here, we assume access to a planning procedure that returns, for any given reward function \(r\) and transition operator \(\mathcal{P}=\langle\phi, \mu\rangle\), an optimal (deterministic) policy \(\arg\max_{\pi\in\Pi}V_{\mathcal{P}, r, 1}^{\pi, d_{1}}\). We note that planning in a known linear MDP can be performed efficiently, for example, with LSVI-UCB \citep{jin2020provably}.

\subsection{Analysis}\label{sec:instance-dependent_regret_bounds}

In the following lemma, we provide a baseline worst-case regret bound for \alg, which does not utilize UniSOFT features. We denote the regret incurred by algorithm \ref{alg:UniSREP} as \(\tilde{\mathcal{R}}\), which differs from the regret incurred by behavior polices \(\{\pi_{t}\}_{t=1}^{T}\) denoted as \(\mathcal{R}\).

\begin{restatable}[Expected Regret without UniSOFT]{lemma}{regretwithoutunisoftreps}\label{lemma:sublinear_expected_regret_without_unisoft}
    Let \(\xi_{t}=t^{-1/4}\). Suppose Assumption \ref{ass:realizability} (realizability) holds. Then, for any \(T\in\mathbb{N}\), \alg (Algorithm \ref{alg:UniSREP}) satisfies
    \[
        \mathbb{E}[\tilde{\mathcal{R}}(T)] = \tilde{O}\left( H^{3}d^{2}|\mathcal{A}|T^{3/4}\right).
    \]
\end{restatable}

Our general strategy for improving the baseline regret given above is to show that there exists an episode after which \alg only selects good representations. Then, these good representations provide more efficient exploration, and we gain an improvement in learning efficiency. Hence, our regret bounds will only improve on the baseline regret result if we run the algorithm for long enough. Furthermore, establishing sub-linear regret without leveraging good representations is important for guaranteeing the selection of good representations at a later stage.

Nevertheless, to select good representations, we must ensure their existence. In that spirit, we introduce representations that approximately represent the ground-truth transition operator over the support of the occupancy distribution induced by the optimal policy. 

\begin{definition}\label{def:alpha_approximate}($\alpha^{\star}$-Approximate Representation)
       A representation \((\phi, \mu)\in\Phi\times\Psi\), with induced model \(\mathcal{P}\), is\ \emph{$\alpha^{\star}$-approximate} at level $\alpha$ if for all \(h\in[H]\), 
\[
    \mathbb{E}_{(s,a)\sim d_{\mathcal{P}^{\star},h}^{\pi^{\star}}}[\Vert\mathcal{P}_{h}(\cdot|s,a) - \mathcal{P}_{h}^{\star}(\cdot|s,a)\Vert_{\textnormal{TV}}] \leq \alpha.
\]
    
\end{definition}

\begin{remark}
    The set of \(\alpha^{\star}\)-approximate representations \(\Phi_{\alpha}\times\Psi_{\alpha}\subseteq\Phi\times\Psi\) is non-empty for any \(\alpha\geq0\), whenever the realizability assumption \ref{ass:realizability} holds.
\end{remark}

Interestingly, we can show that the optimization oracle (Definition \ref{def:opt_oracle}) converges uniformly over the occupancy distribution of the optimal policy (Lemma \ref{lemma:alpha_star_selection}), provided that the distribution is well-defined, that is, Assumption \ref{ass:min_optimal_occupancy_exists} (minimal optimal occupancy) holds. The following assumption exploits this convergence and ensures that we are guaranteed to find a good representation. In Section \ref{sec:more_on_good_representations} we elaborate on how reasonable this assumption is. 

\begin{assumption}\label{ass:expressivness}($\alpha^{\star}$-Expressive Function Space)
    For all $\alpha^{\star}$-approximate representations \((\phi, \mu)\in\Phi_{\alpha}\times\Psi_{\alpha}\), there exists a representation \((\tilde{\phi}, \tilde{\mu})\in\Phi\times\Psi\) that is non-redundant and UniSOFT, such that the induced models \(\mathcal{P}
    \) and \( 
    \tilde{\mathcal{P}}\)
     agree on all \((s,a)\in\mathcal{S}\times\mathcal{A}\).
\end{assumption}

We can show that the UniSOFT loss in Equation \ref{eq:unisoft_loss} eventually eliminates all redundant and all non-UniSOFT feature maps (Lemma \ref{lemma:UniSOFT_selection_full_rank}). Intuitively, if the exploration probabilities $\xi_{t}$ are decreasing and the regret of the behavior policies is sub-linear, the collected transitions will eventually mostly be drawn from the optimal occupancy distribution. Then only good features minimize the UniSOFT loss, which are guaranteed to exist by the expressiveness assumption above. 

Whenever the function space already consists of representations that have low model error on the optimal occupancy distribution, we can provide a purely gap-dependent regret bound.

\begin{restatable}[Gap-dependent regret with UniSOFT]{theorem}{instancedependentregretwithunisoft}\label{thm:instance_dependent_regret_bound_with_unisoft}
       Let $\xi_{t}=t^{-1/3}$ and \(\alpha=1\). Suppose assumptions \ref{ass:realizability} (realizability), \ref{ass:sub_optimality_gap_exists} (minimal sub-optimality gap),
       \ref{ass:expressivness} ($\alpha^{\star}$-expressive function space) and \ref{ass:unique_optimal_policy} (unique optimal policy) hold. Then for any \(T\in\mathbb{N}\), there exists a constant $\tau_{\textnormal{good}}$, such that \alg (Algorithm \ref{alg:UniSREP}) satisfies the following:
    \begin{align*}
    \mathbb{E}[\tilde{\mathcal{R}}(T)] &= \tilde{O}( H^{3}d^{2}|\mathcal{A}|(\tau_{\textnormal{good}}\wedge T)^{5/6} \\
    &\qquad + \frac{1}{\lambda_{\textnormal{max}}^{\star}}H^{4}d|\mathcal{A}|^{1/2}T^{2/3})
    \end{align*}
    where 
    \( \tau_{\textnormal{good}}=\tilde{O}\left(\frac{H^{12}d^{12}|\mathcal{A}|^{6}}{(\Delta_{\textnormal{min}}\lambda_{\textnormal{max}}^{\star})^{6}}\right)
    \) and \(\lambda_{\textnormal{max}}^{\star} = \min_{\alpha}\max_{\phi\in\Phi_{\alpha}}\lambda^{\star}(\phi)\).
\end{restatable}

On a high level, \(\tau_{\textnormal{good}}\) captures the number of episodes \alg needs to eliminate all non-good representations. Hence, the theorem tells us that after some number of "warm-up" episodes \(\tau_{\textnormal{good}}\), during which we incur expected regret according to the parameter-adjusted baseline result (Lemma \ref{lemma:sublinear_expected_regret_without_unisoft}), we gain an increase in learning efficiency provided by the properties of good representations. The duration of the warm-up and the gain in learning efficiency depend on the "goodness" of the available representations, captured by $\lambda^{\star}$. Notable is the worse dependence on the horizon.

If we additionally assume that the minimal optimal occupancy is well-defined (Assumption \ref{ass:min_optimal_occupancy_exists}), we can show that the behavior policies are eventually optimal. In particular, we show that the bonus term serves as an almost optimistic estimate of expected sub-optimality gaps (Lemma \ref{lemma:local_optimism}). Hence, if we are guaranteed to select good representations in each iteration (Lemma \ref{lemma:UniSOFT_selection_full_rank}), the bonus term decreases uniformly over the state-action space, leading to optimal behavior. However, since we bound sub-optimality gaps in expectation, we require $d_{\textnormal{min}}^{\star}$ to be well-defined, in order to determine the optimality of any policy (Lemma \ref{lemma:identify_optimal_policy}). We get the following improved result.

\begin{restatable}[Expected regret with UniSOFT]{theorem}{instancedependentregretwithunisoftandconstantpseudoregret}\label{thm:instance_dependent_regret_bound_with_unisoft_and_constant_pseudo_regret}
   Let \(\alpha>0\), \(\gamma\in(2,4]\) and \(\xi_{t}=t^{-1/\gamma}\). Suppose assumptions \ref{ass:realizability} (realizability), \ref{ass:unique_optimal_policy} (unique optimal policy), \ref{ass:sub_optimality_gap_exists} (minimal sub-optimality gap), \ref{ass:min_optimal_occupancy_exists} (minimal optimal occupancy) and \ref{ass:expressivness} ($\alpha^{\star}$-expressive function space) hold. Then for any \(T\in\mathbb{N}\), there exists a constant $\tau^{\star}$ such that \alg (Algorithm \ref{alg:UniSREP}) satisfies
    \begin{align*}
    \mathbb{E}[\tilde{\mathcal{R}}(T)] =\tilde{O}\left( H^{3}d^{2}|\mathcal{A}|(\tau^{\star}\wedge T)^{1/2 + 1/\gamma} + HT^{\frac{\gamma-1}{\gamma}}\right),
\end{align*}
where \(\tau^{\star} = \tilde{O}\left( \left(\frac{H^{2}d^{2}|\mathcal{A}|}{\alpha\lambda_{\textnormal{max}}^{\star}(\Delta_{\textnormal{min}}d_{\textnormal{min}}^{\star})^{2}}\right)^{\frac{2\gamma}{\gamma - 2}}\right).\)
\end{restatable}

In contrast to $\tau_{\textnormal{good}}$, \(\tau^{\star}\) additionally captures the number of episodes \alg needs to fully explore the feature space and subsequently identify the optimal policy. However, our algorithm still explores uniformly with positive probability, preventing constant regret. We also incur dependence in $\alpha$ and $d_{\textnormal{min}}^{\star}$, capturing the difficulty of selecting $\alpha^{\star}$-approximate representations.

Interestingly, if we assume that the quantities \(\Delta_{\textnormal{min}}\) and \(d_{\textnormal{min}}^{\star}\) are known\footnote{Extensions to lower bounds on $\Delta_{\textnormal{min}}$ and $d_{\textnormal{min}}^{\star}$ are straightforward.}, we can design a termination criterion, which stops the algorithm whenever the behavior policy is optimal. \alg+ extends \alg by an evaluation phase (Lines \ref{alg:eval_start}-\ref{alg:eval_end}) in which we measure the uncertainty in the learned model, through the value of the bonus term. If this uncertainty is below \(\Delta_{\textnormal{min}}d_{\textnormal{min}}^{\star}\), we stop the algorithm and return the optimal policy with high probability. 

\begin{restatable}[Constant Regret]{theorem}{optimalpolicyidentification}\label{thm:optimal_policy_identification}
   Let \(\alpha>0\), $\delta\in(0,1)$ and \(\xi_{t}=t^{-1/4}\). Suppose that the quantities \(\Delta_{\textnormal{min}}\) and \(d_{\textnormal{min}}^{\star}\) are known. Then, under the same assumptions as in Theorem \ref{thm:instance_dependent_regret_bound_with_unisoft_and_constant_pseudo_regret}, with probability at least \(1-2\delta\), \alg+ (Algorithm \ref{alg:UniSREP}) satisfies the following:
   \[
   \tilde{\mathcal{R}}(T) \leq T\wedge\tau^{\star},\] 
   where\footnote{$\tilde{\mathcal{O}}$ hides a constant of order $2^{64}$.} \(\tau^{\star} = \tilde{O}\left(\frac{H^{8}d^{8}|\mathcal{A}|^{4}}{(\alpha\lambda_{\textnormal{max}}^{\star})^{4}(\Delta_{\textnormal{min}}d_{\textnormal{min}}^{\star})^{8}}\right).
   \)
\end{restatable}

%\footnote{$\tilde{\mathcal{O}}$ hides a constant of order $2^{64}$.}

\subsection{Technical Challenges}


The main technical challenge to providing instance-dependent regret lies in controlling the expected sub-optimality gaps. In Lemma \ref{lemma:expected_suboptimalitygap_to_bonus}, we demonstrate that the expected gaps can be controlled w.r.t. the value of the bonus under policy $\pi^{b}$. Unfortunately, this is not the policy that interacts with the environment, and hence the elliptical potential lemma does not work here. Importantly, UniSOFT features uniformly decrease the confidence intervals, which allows us to proceed with our analysis. As such, the role of representation learning and, in particular, that of UniSOFT features is central for our instance-dependent bounds.