\paragraph{Notation} We denote the set of probability distributions on a measurable set $A$ by \(\Delta(A)\). Furthermore, let \(\mathcal{U}(A)\) represent the uniform distribution over some finite set $A$ and let $\textnormal{Ber}(p)$ denote the Bernoulli distribution with success rate $p\in[0,1]$. Additionally, $[N] \coloneqq \{1, ..., N\}$ for any integer $N$. Finally, $\lesssim$ denotes inequalities up to absolute constants and $\tilde{O}(\cdot)$ hides absolute constants and poly-log terms.

We consider a finite-horizon episodic MDP described by the tuple
    \(\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}^{\star}, r^{\star}, H, d_{1})\),
where 
    \(\mathcal{S}\)
is the state space,
    \(\mathcal{A}\)
is the finite action space,
    \(
    \mathcal{P}^{\star} = \{\mathcal{P}_{h}^{\star}\}_{h\in[H]}
    \)
where
    \( \mathcal{P}_{h}^{\star}\colon\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S})
    \)
is the transition operator (unknown) at time step \(h\in[H]\),
    \(
    r^{\star} = \{r_{h}^{\star}\}_{h\in[H]}
    \)
where
    \( r_{h}^{\star}\colon \mathcal{S}\times\mathcal{A}\to[0, 1] \)
is the deterministic reward function (known) at time step \(h\in [H]\),
    \(
    d_{1}\in\Delta(\mathcal{S})
    \)
the initial state distribution (known) and 
    \(H\)
is the episode length. We assume the reward function to be normalized, i.e., \(\sum_{h=1}^{H}\sup_{s,a}r_{h}^{\star}(s,a)\leq 1\).

The agent interacts with MDP \(\mathcal{M}\) in episodes. In particular, in each episode \(t\in\mathbb{N}\), the agent starts in some initial state \(s_{1}\sim d_{1}\), for each time step \(h\in[H]\) observes a state \(s_{h}\), chooses some action \(a_{h}\in\mathcal{A}\), receives a reward \(r_{h}^{\star}(s_{h}, a_{h})\) and transitions to a new state \(s_{h+1}\sim\mathcal{P}_{h}^{\star}(\cdot|s_{h}, a_{h})\). The interaction process in each episode ends at time step \(H+1\).

By 
    \( \Pi = \{\pi = \{\pi_{h}\}_{h\in[H]} \mid \forall h\in[H]:\pi_{h}\colon \mathcal{S}\to\mathcal{A}\}\)
we denote the policy space in which the elements are (deterministic\footnote{We require that our planning procedure outputs (w.l.o.g.) a deterministic policy to ensure that the representation learning oracle converges (see Appendix \ref{sec:unisoft_selection}).}) decision rules that map states to actions for any time step $h$. We define the state value function 
    \(
    V_{\mathcal{P},r;h}^{\pi}(s) = \mathbb{E}[\sum_{i=h}^{H} r_{i}(s_{i}, a_{i})|s_{h}=s,\mathcal{P},\pi]
    \)
to represent the expected total reward of policy
    \( \pi \)
under
    \(\mathcal{P}\) and \(r\)
starting in state 
    \( s\)
at time step
    \(h\).
To simplify notation, we define the function 
    \( \mathcal{P}_{h}V_{\mathcal{P},r, h+1}^{\pi}(s, a) = \mathbb{E}_{s'\sim\mathcal{P}_{h}(\cdot|s, a)}[V_{\mathcal{P},r, h+1}^{\pi}(s')],
    \)
where \(\mathcal{P}_{h}\) should be viewed as an operator on functions \(f\colon \mathcal{S}\to\mathbb{R}\) with \(f\mapsto \mathcal{P}_{h}f\).

We define the Q-function as 
    \(
    Q_{\mathcal{P},r;h}^{\pi}(s,a) = r_{h}(s,a) + \mathcal{P}_{h}V_{\mathcal{P},r;h+1}^{\pi}(s,a)
    \) and let 
    \( V_{\mathcal{P},r;1}^{\pi, d_{1}} = \mathbb{E}_{s\sim d_{1}}[V_{\mathcal{P},r;1}^{\pi}(s)] \), given some initial state distribution
    \(d_{1}\).
 The state-action occupancy distribution
    \( d_{\mathcal{P};h}^{\pi}(s,a)\) denotes the probability of visiting state \(s\) at time step \(h\) and performing action \(a\) in model \(\mathcal{P}\) with policy \(\pi\). By abuse of notation, let 
    \( d_{\mathcal{P};h}^{\pi}(s) = \sum_{a\in\mathcal{A}} d_{\mathcal{P};h}^{\pi}(s,a) \)
denote the state-occupancy distribution at time step \(h\). We can sample a state \(s\) from 
    \(d_{\mathcal{P};h}^{\pi}\)
by executing 
    \(\pi\)
for \(h-1\) steps starting from state
    \(s_{1}\sim d_{1}\).

The agent's goal is to learn an optimal policy \( \pi^{\star} \in \arg\max_{\pi\in\Pi}V_{\mathcal{P}^{\star},r^{\star},1}^{\pi, d_{1}} \), which maximizes the expected total reward under \(\mathcal{P}^{\star}\), \(r^{\star}\) and \(d_{1}\). We evaluate the efficiency of an agent by the (expected) regret
\begin{align}\label{eq:regret}
    \mathbb{E}[\mathcal{R}(T)] = \mathbb{E}[\sum_{t=1}^{T} V_{\mathcal{P}^{\star}, r^{\star},1}^{\pi^{\star}, d_{1}} - V_{\mathcal{P}^{\star},r^{\star},1}^{\pi_{t}, d_{1}}],
\end{align}

which measures the expected cumulative performance loss up to episode \(T\in\mathbb{N}\). Note that the expectation in Equation~\ref{eq:regret} is taken w.r.t.\ any extra randomness induced by the algorithm. 

Finally, we denote the sub-optimality gap of taking action \(a\) in state \(s\) at time step \(h\) as
\(
    \Delta_{h}(s,a) = V_{\mathcal{P}^{\star},r^{\star};h}^{\pi^{\star}}(s) - Q_{\mathcal{P}^{\star},r^{\star};h}^{\pi^{\star}}(s,a),
\)
which measures the loss in value of any sub-optimal action $a$.

\subsection{Structural Assumptions}

In this work, we are interested in MDPs with large, possibly infinite state spaces and hence require some form of structural assumptions such that efficient learning is possible. In particular, we assume that \( \mathcal{P}^{\star} \) admits a low-rank decomposition.

\begin{definition}\label{def:low-rank_mdp}(Low-rank MDP \citep{agarwal2020flambe})
    An MDP
        \( \mathcal{M} \) 
    is \emph{low-rank} or equivalently has \emph{low-rank structure} with rank 
        \( d\in\mathbb{N} \)
    if for every 
        \(h\in[H]\)
    there exist two embedding functions
        \( \phi_{h}^{\star}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}
        \)
    and
        \( \mu_{h}^{\star}:\mathcal{S}\to\mathbb{R}^{d} \)
    such that
    \[
        \forall (s, a, s') \in\mathcal{S}\times\mathcal{A}\times\mathcal{S}\colon \mathcal{P}_{h}^{\star}(s'|s,a) = \langle\phi_{h}^{\star}(s,a), \mu_{h}^{\star}(s')\rangle,
    \]
    where, for normalization, 
        \( \Vert\phi_{h}^{\star}(s,a)\Vert_{2} \leq 1 \)
    and 
        \( \Vert\int_{\mathcal{S}}\mu_{h}^{\star}(s)g(s)ds\Vert_{2} \leq \sqrt{d}\Vert g\Vert_{\infty} \),
    for any function
        \( g\colon \mathcal{S}\to\mathbb{R} \),
        \((s, a, h)\in\mathcal{S}\times\mathcal{A}\times[H]\).
\end{definition}

As the embedding functions \(\phi_{h}^{\star}\) and \(\mu_{h}^{\star}\) are assumed to be unknown, we consider the \emph{representation learning problem} of finding good representations for state-action pairs and states over (known) finite function spaces 
    \( \Phi = \Phi_{1}\times...\times\Phi_{H} \)
and 
    \( \Psi = \Psi_{1}\times...\times\Psi_{H} \)
where, for each \(h\in[H]\),
    \(
    \Phi_{h}\subseteq \{\phi_{h}\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}\}
    \)
and 
    \(
    \Psi_{h}\subseteq \{\mu_{h}\colon \mathcal{S}\to\mathbb{R}^{d}\}
    \).
For notational brevity, we denote \( \phi^{\star}=\{\phi_{h}^{\star}\}_{h\in[H]}\) and \( \mu^{\star}=\{\mu_{h}^{\star}\}_{h\in[H]}\). To ensure tractability of this representation learning problem, we assume realizability of the function spaces
\citep{agarwal2020flambe, uehara2021representation, modi2024model}.

\begin{assumption}\label{ass:realizability}(Realizability)
    For all
        $
        (s, a, h) \in\mathcal{S}\times\mathcal{A}\times[H],
        $
    and any
        $
        (\phi_{h},\mu_{h})\in\Phi_{h}\times\Psi_{h}
        $,
    we have that
        \(
        \Vert\phi_{h}(s,a)\Vert_{2} \leq 1
        \),
    for any function
        \(
        g\colon \mathcal{S}\to\mathbb{R},
        \Vert\int_{\mathcal{S}}\mu_{h}(s)g(s)ds\Vert_{2} \leq \sqrt{d}\Vert g\Vert_{\infty}
        \)
    and 
        \( \int_{\mathcal{S}}\langle\phi_{h}(s,a), \mu_{h}(s')\rangle ds' = 1 \).
    Additionally, there exist (unknown) non-empty subsets
        \( \Phi^{\star} \subseteq \Phi \)
    and 
        \( \Psi^{\star} \subseteq \Psi \)
    such that any 
    \(
        (\phi^{\star}, \mu^{\star}) \in \Phi^{\star}\times\Psi^{\star}
    \)
     fulfills the low-rank MDP Definition \ref{def:low-rank_mdp}.
\end{assumption}


Note that any tuple 
        \(
        (\phi, \mu) \in \Phi\times\Psi
        \)
    naturally induces a distribution over the state space in each time step and, in particular, a transition operator
    \(
        \mathcal{P} \equiv \langle\phi,\mu\rangle
    \).
    
\subsection{Good Representations and Instance-Dependent Properties}

In favor of clarity, the main results are presented under the assumption of a unique optimal policy. In Section \ref{sec:mult_pol} we show how this assumption can be dropped. Let us denote \(\Pi^{\star}\) as the set of all optimal (deterministic) policies. 

\begin{assumption}\label{ass:unique_optimal_policy}(Unique optimal policy)
There exists a unique optimal (deterministic) policy; that is, \(|\Pi^{\star}|=1\).
\end{assumption}

We consider a feature mapping \(\phi\in\Phi\) 
as \emph{good} if it maps the set of state-action pairs reachable by the optimal policy to a set of vectors that span the whole feature space. In particular, good representations are non-redundant and UniSOFT.

\begin{definition}\label{def:unisoft}(UniSOFT Representation \citep{papini2021reinforcement})
    A feature mapping
        \( \phi\in\Phi\)
    is called UniSOFT (Universally Spanning Optimal FeaTures) if for all \(h\in[H]\),
    \begin{align*}
        &\textnormal{span}\{\phi_{h}(s,a)| \forall (s,a):\exists \pi\in\Pi: d_{\mathcal{P}^{\star};h}^{\pi}(s,a)>0\} \\
        &=
        \textnormal{span}\{\phi_{h}(s,\pi^{\star}(s))| \forall s: d_{\mathcal{P}^{\star};h}^{\pi^{\star}}(s)>0\}
    \end{align*}
    holds. In particular, a UniSOFT feature mapping \(\phi\) is \emph{non-redundant} if \(\lambda^{\star}(\phi) > 0\) holds, where
    \begin{align*}
        \lambda^{\star}(\phi):=\min_{h\in[H]} \lambda_{\textnormal{min}}(\mathbb{E}_{(s,a)\sim d_{\mathcal{P}^{\star},h}^{\pi^{\star}}}[\phi_{h}(s,a)\phi_{h}(s,a)^{T}])
    \end{align*}
    and \(\lambda_{\textnormal{min}}(\cdot)\) returns the minimal eigenvalue.
\end{definition}

Intuitively, non-redundant UniSOFT features allow an algorithm to efficiently explore the whole feature space by behaving optimally in the environment. How efficiently the feature space can be explored is dependent on \(\lambda^{\star}(\cdot)\), which, as we will see, will play a major role in the regret bounds provided in the next chapter. Furthermore, we will say that a transition operator \(\mathcal{P}\) \emph{admits} a non-redundant UniSOFT representation, whenever there exists a representation \(\langle\phi,\mu\rangle\equiv\mathcal{P}\) such that \(\phi\) is UniSOFT and non-redundant.

We introduce two additional assumptions that will allow us to take advantage of good representations and perform an instance-dependent regret analysis. A very natural measure of hardness is the minimal sub-optimality gap, which captures the difficulty in detecting sub-optimal actions.

\begin{assumption}\label{ass:sub_optimality_gap_exists}(Well-defined minimal sub-optimality gap)
The quantity
    \[
        \Delta_{\textnormal{min}} := \min_{s\in\mathcal{S}, a\in\mathcal{A}, h\in[H]: \Delta_{h}(s,a)>0} \Delta_{h}(s,a)
    \]    
    is well-defined.
\end{assumption}

Finally, we assume that the minimal optimal occupancy exists. Intuitively, we ensure that when playing an optimal decision policy, we will eventually visit all states reachable by this policy. 

\begin{assumption}\label{ass:min_optimal_occupancy_exists}(Well-defined minimal optimal occupancy)
The quantity
    \[d_{\textnormal{min}}^{\star} = \min_{s\in\mathcal{S},a\in\mathcal{A},h\in[H],\pi^{\star}\in\Pi^{\star}: d_{\mathcal{P}^{\star},h}^{\pi^{\star}}(s,a) > 0 }d_{\mathcal{P}^{\star},h}^{\pi^{\star}}(s,a)\]
    is well-defined.
\end{assumption}

Note that both assumptions are trivially satisfied whenever $\mathcal{S}$ and $\mathcal{A}$ are finite.
