Title: COMPUTATIONALLY EFFICIENT RL UNDER LINEAR BELLMAN COMPLETENESS FOR DETERMINISTIC DY-NAMICS

Abstract: We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting. This setting uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least squares regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.

Section: INTRODUCTION
Various application domains of Reinforcement Learning (RL)-including game playing, robotics, self-driving cars, and foundation models-feature environments with large state and action spaces. In such settings, the learner aims to find a well performing policy by repeated interactions with the environment to acquire knowledge. Due to the high dimensionality of the problem, function approximation techniques are used to generalize the knowledge acquired across the state and action space. Under the broad category of function approximation, model-free RL stands out as a particularly popular approach due to its simple implementation and relatively better sample efficiency in practice. In model-free RL, the learner uses function approximation (e.g., an expressive function class like deep neural networks) to model the state-action value function of various policies in the underlying MDP. In fact, the combination of model-free RL with various empirical exploration heuristics has led to notable empirical advances, including breakthroughs in game playing (Silver et al., 2016;Berner et al., 2019), robot manipulation (Andrychowicz et al., 2020), and self-driving (Chen et al., 2019).
Theoretical advancements have paralleled the practical successes in RL, with tremendous progress in recent years in building rigorous statistical foundations to understand what structures in the environment and the function class suffice for sample-efficient RL. These advancements are supported by optimal exploration strategies that align with the corresponding structural assumptions, and by now we have a rich set of tools and techniques for sample-efficient RL in MDPs with large state/action spaces (Russo & Van Roy, 2013;Jiang et al., 2017;Sun et al., 2019;Wang et al., 2020;Du et al., 2021;Jin et al., 2021;Foster et al., 2021;Xie et al., 2022). However, despite a rigorous statistical foundation, a significant challenge remains: many of these theoretically rigorous approaches for rich function approximation are not computationally feasible, and thus have limited practical applicability. For example, some require solving complex optimization problems that are computationally intractable in practice (Zanette et al., 2020b); others require deterministic dynamics and initial states (Du et al., 2020); and some methods depend on maintaining large and complex version spaces (Jin et al., 2021;Du et al., 2021) which are intractable in terms of memory and computation.
One of the most striking examples of this statistical-computational gap is observed in the Linear Bellman Completeness setting, which is perhaps one of the simplest learning settings. Linear Bellman completeness serves as a bridge between RL and control theory literature as it provides a unified framework to capture Linear MDPs (Jin et al., 2020;Agarwal et al., 2019;Zanette et al., 2020b) and the Linear Quadratic Regulator (LQR), two popular models in RL and control respectively. In particular, the linear Bellman completeness setting captures MDPs where the state-action value function of the optimal policy is a linear function of some pre-specified feature representations (of states and actions), and the Bellman backups of linear state-action value functions are linear (w.r.t. some feature representation). Naturally, for this setting, the learner utilizes the function class F consisting of all linear functions over the given feature representation as the value function class for model-free RL. In addition to considering a linear class, we also assume that the class F exhibits low inherent Bellman error-a structural assumption that quantifies the error in approximating the Bellman backup of functions within F. The first assumption, i.e., linearity of optimal state-action value function, is perhaps the simplest modeling assumption one can make in RL with function approximation. Furthermore, emerging evidence suggests that linearity is practically useful, as with adequate feature representation, linear functions can represent value functions in various domains. The second assumption, i.e. low inherent Bellman error of the class, while being a bit mysterious, is a natural condition for statistical tractability for classic algorithms such as Fitted Q-iteration (FQI) and temporal difference (TD) learning with linear function approximation (Munos, 2005;Zanette et al., 2020b). It is also well-known that linearity alone does not suffice for efficient RL (Wang et al., 2021;Weisz et al., 2021).
While the prior works have shown that RL with linear bellman completeness is statistically tractable, and one can learn with sample complexity that scales polynomially with both d and H (where d is the dimensionality of the feature representation and H is the horizon of the RL problem), the proposed algorithms that obtain such sample complexity in the online RL setting are not computationally efficient. Given the simplicity of the problem, it was conjectured that a computationally efficient algorithm should exist. However, no such algorithms were proposed. Unfortunately, the classical approaches of combining supervised learning techniques with RL in the online setting, e.g., value function iteration, which are computationally efficient by design, fail to extend to be statistically tractable due to exponential blowups from error compounding, especially without making norm-boundedness assumptions. On the other hand, the techniques of adding quadratic exploration bonuses, e.g., the one proposed in LinUCB (Li et al., 2010) and used in LSVI for linear MDPs, also fail here as Bellman backups of quadratic functions are not necessarily within the linear class F. In fact, the search for a computationally efficient algorithm with large action spaces is open even when the transition dynamics are deterministic.
In this work, we provide the first computationally efficient algorithm for the linear Bellman complete setting with deterministic dynamics, that enjoys regret bound of Õ(d 5/2 H 5/2 + d 2 H 3/2 T 1/2 ) for feature dimension d, horizon H, and number of rounds T . Importantly, our algorithm works with large action spaces, stochastic reward functions, and stochastic initial states. The key ideas of our algorithm are twofold: using randomization to encourage exploration and leveraging a span argument to bound the regret. While adding random noise to the learned parameters has been quite successful in linear function approximation, unfortunately, for our specific setting, since we need to add sufficiently large noise to cancel out the estimation error, blind randomization can cause the corresponding parameters to grow exponentially with the horizon. We avoid paying for this blow-up by only adding noise to the null space of the data. In particular, when the dynamics are deterministic, by adding exploration noise only in the null space, we can learn the value function exactly for any trajectories that lie within the span of the data seen so far. Additionally, a simple span argument bounds the number of times the trajectories fall outside the span of the historical data. Together, these techniques leads to our polynomial sample complexity bound. The resulting algorithm relies on linear regression oracles under convex constraints, which we show can be approximately solved via a random-walk-based algorithm (Bertsimas & Vempala, 2004).

Section: RELATED WORKS
Our work builds upon and differentiates itself from several lines of research in computationally efficient Reinforcement Learning (RL) and exploration strategies. We categorize the most relevant prior efforts below.

Computational Efficient RL under Linear Bellman Completeness. The pursuit of computationally efficient RL algorithms within the linear Bellman completeness (LBC) setting has been a significant area of focus. For basic tabular MDPs, efficient and near-optimal algorithms are well-established (Azar et al., 2017; Zhang et al., 2020; Jin et al., 2018). These results extend to linear MDPs (Jin et al., 2020), where computationally efficient algorithms, such as LSVI-UCB, are also known (Jin et al., 2020; Agarwal et al., 2023; He et al., 2023). However, the broader LBC setting, which subsumes linear MDPs but lacks certain restrictive assumptions (e.g., on Bellman operator norm-boundedness or feature representation properties), presents a greater challenge. The existence of computationally efficient algorithms for general LBC has remained an open question. Previous works have often relied on strong simplifying assumptions to achieve computational efficiency, such as a limited number of actions (Golowich & Moitra, 2024) or "explorable" MDPs (Zanette et al., 2020c), which do not hold universally. Our approach specifically targets the LBC setting with deterministic dynamics but without these strong assumptions, addressing a previously unmet need. We provide a more detailed overview of the literature and its limitations in Section 3.2.

Exploration via Randomization. Random noise injection has emerged as a powerful alternative to bonus-based exploration in RL. Randomized Least-Squares Value Iteration (RLSVI) (Osband et al., 2016) is a prominent example, injecting Gaussian noise into least-squares estimates to achieve near-optimal worst-case regret for linear MDPs (Agrawal et al., 2021; Zanette et al., 2020a). Other randomization techniques include posterior sampling via Langevin Monte Carlo for Q-functions (Ishfaq et al., 2023) and methods for general function approximation with bounded eluder dimension and Bellman completeness (Ishfaq et al., 2021). Randomization is also explored in preference-based RL (Wu & Sun, 2024). However, a critical limitation of these prior randomization methods in the LBC setting is their tendency to cause exponential blow-up of parameter values. This occurs because the noise injected is often larger than the estimation error, and without strong norm-boundedness assumptions (which we do not make), the parameters can grow unboundedly with the horizon. While truncation is a common mitigation strategy, it is only feasible in low-rank MDPs and fundamentally incompatible with linear Bellman completeness, as the Bellman backup of a truncated value function is generally no longer linear. Our work introduces a novel null-space randomization technique that explicitly addresses this parameter blow-up issue, making randomization viable for LBC problems.

Beyond Linear Bellman Completeness. The LBC setting itself is a specific structural condition. Broader structural conditions like Bilinear classes (Du et al., 2021), Bellman eluder dimension (Jin et al., 2021), Bellman rank (Jiang et al., 2017), witness rank (Sun et al., 2019), and decision-estimation coefficients (Foster et al., 2021) have also been explored. While these conditions have led to statistically efficient algorithms, computationally efficient counterparts often remain unknown, further highlighting the general challenge of bridging the statistical-computational gap in RL with rich function approximation.

Section: PRELIMINARIES
A finite-horizon Markov Decision Process (MDP) is given by a tuple M = (S, A, H, T, r, µ) where S is the state space, A is the action space, H ∈ N is the horizon, T ∶ S × A → ∆(S) is the transition function, r ∶ S × A → [0, 1] is the reward function and µ ∈ ∆(S) is the initial state distribution. Given a policy π ∶ S ↦ ∆(A), we denote
Q π h (s, a) = E π [∑ H i=h r i | s h = s, a h = a]
as the layered state-action value function of policy π and V π h (s) = Q π h (s, π(s)) as the state value function. The optimal value function is denoted by
V ⋆ h (s) = max π V π h (s)
, and the optimal policy is π ⋆ . We focus on the setting of linear function approximation and consider the following linear Bellman completeness, which ensures that the Bellman backup of a linear function remains linear. Definition 1 (Linear Bellman Completeness). An MDP is said to be linear Bellman complete with respect to a feature mapping ϕ if there exists a mapping T ∶ R d → R d so that, for all θ ∈ R d and all (s, a) ∈ S × A, it holds that
⟨T θ, ϕ(s, a)⟩ = E s ′ ∼T(s,a) max a ′ ⟨θ, ϕ(s ′ , a ′ )⟩.
Moreover, we require that, for all h ∈ [H] and (s, a) ∈ S × A, the random reward is bounded in
[0, 1] with mean r h (s, a) = ⟨ω ⋆ h , ϕ(s, a)⟩ for some unknown ω ⋆ h ∈ R d .
We assume ∥ϕ(s, a)∥ 2 ≤ 1 for all s ∈ S and a ∈ A. Notably, we do not impose any upper bound on ∥ω ⋆ h ∥ 2 or any ℓ 2 -norm non-expansiveness of the Bellman backup, distinguishing us from some existing works-in Section 3.1, we discuss why many existing definitions of linear Bellman completeness fail to capture even tabular MDPs or linear MDPs due to certain ℓ 2 -norm boundedness assumptions.
We further assume the feature space spans R d , i.e., span({ϕ(s, a) ∶ s ∈ S, a ∈ A}) = R d ; otherwise, we can project the feature space onto its span or use pseudo-inverse in the analysis when needed. We can verify that the linear Bellman completeness captures both linear MDPs and Linear Quadratic Regulators (for a convex subset of linear functions). The proof is in Appendix E.
Next, we consider deterministic state transition. Assumption 1 (Deterministic transitions). For all s ∈ S and a ∈ A, there is a unique state s ′ ∈ S to which the system transitions to after taking action a on state s.
We emphasize that, although the transition is deterministic, the initial state distribution is stochastic (although we assume that {s t,1 } t≤T is independently sampled from an initial distribution µ, our results extend to the scenarios when {s t,1 } t≤T are adversarially chosen). Additionally, the reward signals can be stochastic. Hence, learning is still challenging in this case. The goal is to achieve low regret over T rounds. The regret is defined as
Reg T ∶= E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] .
The expectation here is taken over the randomness of algorithm and reward signals. While it is defined as an average for simplicity, a concentration inequality can yield the high-probability regret. In this paper, we use asymptotic notations Θ(⋅) and Õ(⋅) to hides logarithmic and constant factors.

Section: OTHER LINEAR BELLMAN COMPLETENESS DEFINITIONS IN THE LITERATURE
Several closely related definitions of Linear Bellman Completeness have been considered in the literature. In the following, we demonstrate that some of these variant definitions face limitations due to additional ℓ 2 -norm assumptions. We present two commonly imposed assumptions in existing works below, and subsequently provide examples illustrating their potential limitations.
(1) Assuming Bounded ℓ 2 -norm of Parameters. Golowich & Moitra (2024);Zanette et al. (2020b;c) assume that any value function under consideration has its parameters bounded in ℓ 2norm, i.e., when we apply the Bellman backup, the resulting state-action value function always lies in {Q ∶ Q(s, a) = ⟨ϕ(s, a), θ⟩, ∥θ∥ 2 ≤ R} where R is a pre-fixed polynomial in the dimension of the feature space. We will show that this assumption might not hold true since ∥θ∥ 2 is unnecessarily bounded under linear Bellman completeness.
(2) Assuming Non-expansiveness of Bellman Backup in ℓ 2 -norm. Song et al. (2022) assume that, after applying the Bellman backup, the ℓ 2 -norm of the value function parameters will not increase, i.e., for any θ, they assume the existence of parameter θ ′ such that ∥θ ′ ∥ 2 ≤ ∥θ∥ 2 and ⟨ϕ(s, a), θ ′ ⟩ = Es ′ ∼T(s,a) max a ′ ⟨ϕ(s ′ , a ′ ), θ⟩ for all s, a. This assumption is stronger than the previous one and does not hold even in tabular MDPs, as we will show in the second example below.
The following example demonstrates that the two assumptions above do not generally hold under linear Bellman completeness as the ℓ 2 -norm amplification can actually be arbitrarily large. Example 1 (Arbitrarily Large ℓ 2 -norm on Parameters). Consider a layered linear MDP with three states, s 1 , s 2 , s 3 , and a single action a 1 . Here s 1 is in the first layer and s 2 and s 3 are in the second layer. For some ε and p, we define ϕ(s 1 , a
1 ) = ( √ ε, √ p -ε), µ(s 2 ) = (p/ √ ε, 0), and µ(s 3 ) = (0, (1 -p)/ √ p -ε).
We further define r(s 2 , ⋅) = ε and r(s 3 , ⋅) = 1. We can verify that P (s 2 |s 1 , a 1 ) = p and P (s 3 |s 1 , a 1 ) = 1p. Hence Q(s 1 , a 1 ) = pε + 1p. We assume Q-function is parameterized by θ. Then, since ∥ϕ(s 1 , a 1 )∥ = p, it must hold that ∥θ∥ ≥ (pε + 1p)/p = ε + p -1 -1. While p can be arbitrarily small, the norm of θ can be arbitrarily large.
We may hope to "normalize" the features in this example so that the ℓ 2 -norm of the parameters is bounded. However, it is unclear how to do so since changing either ε or p will change the MDP, and feature search is likely a hard problem. Essentially, this example breaks one of the assumptions in the original linear MDP (Jin et al., 2020) which requires the integral ∫ gµ to be bounded for any function g ∈ [0, 1]. Thus, while being a linear MDP, the original LSVI-UCB algorithm (Jin et al., 2020) indeed will not work for this example. However, we note that our algorithm can still work.
Nevertheless, as the above example leverages a careful design of the feature, we might hope that some non-expansiveness properties could hold under stronger representation assumptions (e.g., when state space is tabular). Unfortunately, the following example shows that Bellman backup can be expansive even in tabular MDPs.
Example 2 (Expansiveness of Bellman Backup in ℓ 2 -norm). Consider a tabular MDP with horizon H = 2, S states {s 1 , . . . , s S } in the first layer, a single state s in the second layer, and a single action a. On taking action a in any state in the first-layer, the agent deterministically transitions to s, and on taking action a in s deterministically yields a reward of 1. Since linear Bellman completeness captures tabular MDPs with one-hot encoded features where ϕ(s i , a) = e i ∈ R S+1 for i ≤ S and ϕ(s, a) = e S+1 = (0, . . . , 0, 1) ⊺ , the state-action value function at the second layer can be parameterized by θ 2 = (0, . . . , 0, 1) ⊺ . However, applying the Bellman backup, since the returnto-go for any first-layer state s i is 1 (because s always yields a reward of 1), the backed-up value function must be parameterized by θ 1 = (1, 1, . . . , 1) ⊺ . Here, we find that ∥θ 1 ∥ 2 /∥θ 2 ∥ 2 = √ S, thus showing that Bellman backup cannot guarantee non-expansiveness of the ℓ 2 -norm.
Hence, in this paper, we aim not to assume any ℓ 2 -norm bound or ℓ 2 -norm non-expansiveness of the parameters. Unfortunately, without these assumptions, the ground truth parameter of the optimal value function can exponentially grow with the horizon as evidenced by the examples above, thus invalidating prior methods requiring bounded parameter. Our key contribution is an algorithm that remains efficient even if the parameter norm blows up but requiring deterministic transition.

Section: OTHER PRIOR WORKS ON LINEAR BELLMAN COMPLETENESS
This section provides a comprehensive review of prior research on RL within the linear Bellman completeness framework, critically examining the assumptions and limitations of existing approaches. Our analysis highlights the specific challenges that our proposed algorithm overcomes.

Efficient Algorithms under Generative Access. Algorithms that assume access to a generative model, which can provide samples of the next state (s' ~ T(⋅ | s, a)) and reward signal for any given state-action pair (s, a), have achieved statistical and computational efficiency. Linear Least-Squares Value Iteration (LSVI) is a prime example within this category (Agarwal et al., 2019). However, the assumption of generative access is often unrealistic in practical online RL scenarios. Our work specifically focuses on the more challenging online access setting, where the agent interacts directly with the environment.

Efficient Algorithms under Explorability Assumption. Some prior works, such as Zanette et al. (2020c), propose reward-free algorithms contingent on an "explorability" assumption. This assumption posits that every direction in the parameter space is reachable, which, in tabular MDPs, implies that any state can be reached with a sufficiently high probability. This condition is restrictive and does not hold in environments with unreachable states or where reaching certain states has exponentially small probabilities. Our algorithm operates without such strong assumptions on environmental explorability.

Computationally Intractable Algorithms. A notable portion of the literature, including Zanette et al. (2020b), presents statistically efficient algorithms that are unfortunately computationally intractable, often requiring the solution of complex, non-convex optimization problems. A core design principle of our work is to rely solely on tractable squared loss minimization oracles, ensuring computational feasibility.

Few-Action MDPs. Golowich & Moitra (2024) introduced a computationally efficient algorithm for linear Bellman completeness, extending bonus-based exploration from LSVI-UCB (Jin et al., 2020) for Linear MDPs. While their method handles stochastic MDPs, both its sample complexity and running time exhibit an exponential dependence on the size of the action space. This makes it impractical for problems with large or continuous action spaces. In contrast, our algorithm is designed to scale efficiently to infinite action spaces, albeit under the assumption of deterministic transition dynamics.

Deterministic Rewards or Deterministic Initial State. Several studies have developed computationally and statistically efficient algorithms for more general settings by imposing strong assumptions on either the reward function or the initial state distribution. Du et al. (2020) present an efficient algorithm for the Linear Q⋆ setting with deterministic transitions, deterministic initial states, and stochastic rewards, leveraging a span argument. However, their approach cannot be directly extended to scenarios with stochastic initial states, which are explicitly considered in our paper. Another line of work by Wen & Van Roy (2017) addresses the Q⋆-realizable setting with deterministic dynamics, deterministic rewards, stochastic initial states, and bounded eluder dimension. While extendable to linear Bellman completeness when both rewards and dynamics are deterministic, their algorithm struggles with stochastic rewards, thus limiting its applicability to our problem setting.

Efficient Algorithms in Hybrid RL. Song et al. (2022) explore efficient algorithms in a hybrid RL setting, where the learner benefits from both online interaction and an existing offline dataset. While valuable, their work does not provide a fully online algorithm, which is the primary focus of our research.

In summary, despite significant prior work, a computationally efficient online RL algorithm for the linear Bellman complete setting that simultaneously accommodates stochastic initial states, stochastic rewards, and large (or infinite) action spaces, while only requiring deterministic transition dynamics, remained an open problem. This paper directly addresses and fills this critical gap.

Section: ALGORITHM
In this section, we present our algorithm for online RL under linear Bellman completeness. See Algorithm 1 for pseudocode. The input to the algorithm consists of three components. First, the noise variances, {σ h } H h=1 and σ R , control the scale of the random noise. Second, a D-optimal design (defined below) for the feature space.
Definition 2 (D-optimal design). The D-optimal design for the set of features Φ = {ϕ(s, a) ∶ s ∈ S, a ∈ A} is a distribution ρ over Φ that maximizes log det(∑ ϕ∈Φ ρ(ϕ)ϕϕ ⊺ ).
There always exist D-optimal designs with at most O(d 2 ) support points (Lemma 23). Many efficient algorithms can be applied to find approximate D-optimal designs such as the Frank-Wolfe. The algorithm also requires a constrained squared loss minimization oracle O sq , and we introduce an instantiation of O sq in Section 6.

Section: Algorithm 1 Null Space Randomization for Linear Bellman Completeness
Require:
• Noise variances {σ h } H h=1 and σ R . • A D-optimal design for Φ = {ϕ(s, a) ∶ s ∈ S, a ∈ A} given by {(ϕ i , ρ i )} m i=1 . • Squared loss minimization oracle O sq . 1: Define Σ 1,h ∶= ∑ m i=1 ρ i ϕ i ϕ ⊺ i for all h ∈ [H]. 2: for t = 1, . . . , T do 3:
Let θ t,H+1 ← 0, Q t,H+1 ← 0, V t,H+1 ← 0.

Section: 4:
for h = H, . . . , 1 do 5:
Let P t,h be the orthogonal projection matrix onto span({ϕ(s i,h , a i,h ) ∶ i = 1, . . . , t -1})

Section: 6:
For i ∈ [m], define ϕ ∥ t,h,i = P t,h ϕ i and ϕ ⊥ t,h,i = (I -P t,h )ϕ i 7:
Let Λ t,h ← ∑ m i=1 ρ i (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ ) 8:
// Fit value function and reward using squared loss regression // 9:
Compute θt,h and ωt,h using the squared loss minimization oracle O sq as:
θt,h ← argmin θ∈O(W h ) t-1 ∑ i=1 (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 (1) ωt,h ← argmin ω∈O(1) t-1 ∑ i=1 (⟨ω, ϕ(s i,h , a i,h )⟩ -r i,h ) 2(2) 10:
// Perturb the estimated parameters by adding Gaussian noise // 11:
Update the parameters by sampling: The algorithm begins by initializing the covariance matrix Σ 1,h for all h ∈ [H] using the optimal design, which differs from most standard LSVI-type algorithms where it is initialized to the identity matrix. We believe that the identity matrix is unsuitable here since we do not assume any ℓ 2 -norm bound on the parameters. Additionally, recalling that we assume the feature space spans R d , it ensures Σ t,h is invertible for all t and h. Otherwise, pseudo-inverses can be used instead.
θ t,h ∼ θt,h + N (0, σ 2 h (I -P t,h )Λ -1 t,h (I -P t,h )) ω t,h ∼ ωt,h + N (0, σ 2 R Σ -1 t,h ) 12: Define Q t,
At each round t ∈ [T ], the algorithm operates in a backward manner starting from the last horizon H. For each h ∈ [H], it first constructs the orthogonal projection matrix P t,h onto the span of the historical data. It then decomposes the D-optimal design points into the span and null space components using the projection and constructs Λ t,h . By separating the span and null space components, it facilitates clearer concentration bounds for the subsequent Gaussian noise.
The algorithm then performs constrained squared loss regression to estimate the value function and reward function. Here we define O(W ) ∶= {θ ∈ R d ∶ |⟨θ, ϕ(s, a)⟩| ≤ W for all s ∈ S, a ∈ A} for any W > 0. This convex constrained set is defined by the ℓ ∞ -functional-norm bound instead of the ℓ 2 -norm because we do not assume any bound on the ℓ 2 -norm of the learned parameters. Here we define
W h = Θ((d √ mH) H-h (d 3/2 + d √ mH
)) (detailed definition deferred to Appendix C). We note that although W h appears exponential, which may seem suspicious, this does not affect our sample efficiency due to the span argument that we introduce in the analysis. We note that prior RLSVI algorithms used truncation on value functions to explicitly avoid such an exponential blow-up. However, truncation does not work for linear Bellman completeness setting since the Bellman backup on a truncated value function is not necessarily a linear function anymore.
Next, the algorithm perturbs the estimated parameters by adding Gaussian noise. The noise for the value function act only in the null space of the data covariance matrix. This ensures optimism while keeping the estimate accurate in the span space. It is a key modification from the standard RLSVI algorithm. The perturbation for the reward function is standard. Finally, the algorithm constructs the value function for the current horizon and the greedy policy with respect to it. It then generates the trajectory by executing the greedy policy, and the covariance matrix is updated.

Section: ANALYSIS
In this section, we provide the theoretical guarantees of Algorithm 1. A high-level proof sketch can be found in Appendix B and detailed proofs are in Appendix C. We first consider the case where the squared loss minimization oracle is exact. We then extend the analysis to the approximate oracle and the low inherent linear Bellman error setting in subsequent sections.

Section: PRELUDE: LEARNING WITH EXACT SQUARE LOSS MINIMIZATION ORACLE
We first consider the most ideal setting where the squared loss minimization oracle is exact. Assumption 2 (Exact Squared Loss Minimization Oracle). Line 9 of Algorithm 1 is solved exactly.
Then, we have the following regret bound. A proof sketch is provided in Appendix B for the readers convenience. Theorem 1 (Regret Bound with Exact Oracle). Under Assumptions 1 and 2, executing Algorithm 1 with parameters σ R = Θ( √ dH log(HT )) and
σ h = Θ((d √ mH) H-h+1 ( √ d + √ mH)), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T ).
This result has several notable features. First, it does not depend on the number of actions. The only requirement for the action space is the ability to compute the argmax. Second, the √ T -dependence on T is optimal, as it is necessary even in the bandit setting. Additionally, we emphasize that the dependence on √ T arises solely from reward learning due to the application of elliptical potential lemma. In fact, if the reward function is known, our regret bound can be as small as Õ(dH 2 ), depending on T up to logarithmic factors. We elaborate on this observation in Appendix B. As a standard practice, Theorem 1 can be converted into a sample complexity bound below. Corollary 1 (Sample Complexity Bound). Let ε ≤ 1. Under the same setting as Theorem 1, letting T ≥ Ω(d 4 H 3 /ε 2 ), we get that the policy π chosen uniformly from the set π 1 , . . . , π T enjoys performance guarantee
E[V ⋆ -V π ] ≤ ε.
Published as a conference paper at ICLR 2025

Section: LEARNING WITH APPROXIMATE SQUARE LOSS MINIMIZATION ORACLE
Assumption 3 (Approximate Squared Loss Minimization Oracle). We assume access to an approximate squared loss minimization oracle O sq apx that takes as input a problem of the form:
argmin θ∈O(W ) g(θ) ∶= ∑ (ϕ(s,a),u)∈D (⟨θ, ϕ(s, a)⟩ -u) 2 where O(W ) = {θ ∈ R d | |⟨θ, ϕ(s, a)⟩| ≤ W } for some W ∈ R is a convex set, and D is a dataset of tuples {(ϕ(s, a), u)}. The oracle returns a point θ that satisfies g( θ) -min θ∈O(W ) g(θ) ≤ ε 2 1 and θ ∈ O(W + ε 2 ) where ε 1 , ε 2 ≤ 1 are precision parameters of the oracle.
With an approximate oracle, the regret bound depends on an additional quantity defined below.
Assumption 4. There exists a constant γ > 1 such that, for any r ≤ d, and for any ϕ 1 , ϕ 2 , . . . , ϕ r ∈ Φ, the eigenvalues of the matrix Σ ∶= ∑ r i=1 ϕ i ϕ ⊺ i are either zero or at least 1/γ 2 .
As a concrete example, it holds with γ = 1 when the MDP is tabular. This assumption implies that the eigenvalues of Σ † are at most γ 2 . Consequently, for any vector ϕ ∈ Φ, we have ∥ϕ∥ Σ † ≤ ∥ϕ∥ 2 γ ≤ γ-this lower bound on the norm of any vector is exactly what we need for the analysis of an approximate oracle, while Assumption 4 simply serves as a sufficient condition for it. The following theorem provides the regret bound with the approximate oracle in terms of parameters ε 1 , ε 2 and γ.
Theorem 2 (Regret Bound with Approximate Oracle). Under Assumptions 1, 3 and 4, executing Algorithm
1 with σ R = Θ( √ dH) and σ h = Θ((d √ mH) H-h+1 (ε 1 γ √ H + √ d + √ mH), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T )).
Compared to Theorem 1, the regret bound has an additional term that depends on the approximation error ε 1 γ. Typically, ε 1 is from optimization and thus can be exponentially small with respect to the relevant parameters, as we later discuss in Section 6. Hence, we allow γ to be exponentially large. Moreover, we note that ε 2 does not appear in the regret bound since it only affects the constraint violation of the regression, whose effect to the statistical guarantees is of lower order and thus ignored. In addition, we note that the regret bound does not depend on the number of actions, and the dependence on T remains optimal, similar to the previous theorem.

Section: LEARNING WITH LOW INHERENT LINEAR BELLMAN ERROR
Now we consider the setting where the MDP has low inherent linear Bellman error.
Definition 3 (Inherent Linear Bellman Error). Given ε B ≤ 1, an MDP M is said to have ε Binherent linear Bellman error with respect to a feature mapping ϕ if there exists a mapping
T ∶ R d → R d so that, for all θ ∈ R d and all (s, a) ∈ S × A, it holds that |⟨T θ, ϕ(s, a)⟩ - Es ′ ∼T(s,a) max a ′ ⟨θ, ϕ(s ′ , a ′ )⟩| ≤ ε B . Moreover, we require that, for all h ∈ [H] and (s, a) ∈ S × A, the random reward is bounded in [0, 1] with |r h (s, a) -⟨ω ⋆ h , ϕ(s, a)⟩| ≤ ε B for some unknown ω ⋆ h ∈ R d .
With low inherent Bellman error, Assumption 4 is still necessary. The following theorem provides the regret bound in this case. We assume the exact oracle for simplicity.
Theorem 3 (Regret Bound with Low Inherent Bellman Error). Assume the MDP has ε B -inherent Bellman error. Under Assumptions 1, 2 and 4, when executing Algorithm 1 with parameters
σ R = Θ( √ dH + ε B HT ) and σ h = Θ((d √ mH) H-h+1 (ε B γ √ HT + √ ε B T + √ d + √ mH)), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT )).
Compared to Theorem 1, the regret bound includes two additional terms that depend on the inherent linear Bellman error ε B . For both terms, the dependence on T is linear. We believe the T -dependence is unavoidable, as it also appears in similar settings (Zanette et al., 2020b). In addition, it is worth noting that the regret bound does not depend on the number of actions, and the other dependence on T remains optimal, similar to previous theorems.

Section: OPENING THE BLACK-BOX: IMPLEMENTING SQUARED LOSS MINIMIZATION ORACLES IN ALGORITHM 1
In this section, we detail a practical implementation of the desired squared loss oracle need by our algorithm. The implementation relies on the observation that a square loss minimization objective over a convex domain can be cast as a convex set feasibility problem-given a convex set K, return a point θ ∈ K. Thus, we can use algorithms for convex set feasibility to implement the squared loss minimization oracles. However, even given this observation, our key challenge for an efficient algorithm is that the corresponding convex set could be exponentially large and only be described using exponentially many number of linear constraints. Fortunately, various works in the optimization literature propose computationally efficient procedures to find feasible points within such ill-defined sets, under mild oracle assumptions.

Section: COMPUTATIONALLY EFFICIENT CONVEX SET FEASIBILITY
We first paraphrase the work of Bertsimas & Vempala (2004) that provide a computationally efficient procedure for finding feasible points within a convex set by random walks. Notably, the computational complexity of their algorithm only depends logarithmically on the size of the convex set, and thus their approach is well suited for the corresponding convex feasibility problems that appear in our approach. At a high level, they provide an algorithm that takes an input an arbitrary convex set K ⊆ R d , and returns a feasible point ẑ ∈ K. Their algorithm accesses the convex set K via a separation oracle defined as follows.
Definition 4 (Separation oracle). A separation oracle for a convex set K, denoted by O sep K , is defined such that on any input z ∈ R d , the oracle either confirms that z ∈ K or returns a hyperplane ⟨a, z⟩ ≤ b that separates the point z from the set K.
In order to ensure finite time convergence for their procedure, they assume that the convex set K is not degenerate and is bounded in any direction. This is formalized by the following assumption.
Assumption 5. The convex set K is (r, R)-Bounded, i.e. there exist parameters 0 < r ≤ R such that (a) K ⊆ R ∞ (R), and (b) there exists a vector z ∈ R d such that the shifted cube (z + R ∞ (r)) ⊆ K.
The computational efficiency and the convergence guarantee of their algorithm are below.
Theorem 4 (Bertsimas & Vempala (2004)). Let δ ∈ (0, 1) and K ⊂ R d be an arbitrary convex set that satisfies Assumption 5 for some 0 ≤ r ≤ R. Then, Algorithm 2 (given in the appendix), when invoked with the separation oracle O sep K w.r.t. K, returns a feasible point ẑ ∈ K with probability at least 1δ. Moreover, Algorithm 2 makes O(d log( R /δr)) calls to the oracle O sep K and runs in time O(d 7 log( R /δr)).
Notice that both the number of oracle calls and the running time only depend logarithmically on R and r, and thus their procedure can be efficiently implemented for our applications where R /r may be exponentially large in the corresponding problem parameters.

Section: COMPUTATIONALLY EFFICIENT ESTIMATION OF VALUE FUNCTION (EQN (1))
We now described how to leverage the method by Bertsimas & Vempala (2004) to estimate the parameters for the value functions in (1) in Algorithm 1. Note that for any time t and horizon h ∈ [H], the objective in (1) is the optimization problem
θt,h ← argmin θ∈O(W h ) t-1 ∑ i=1 (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ,(3)
where
W h = Θ((d √ mH) H-h (ε 1 dγ √ H + d 3/2 + d √ mH)).
We provide a computationally efficient procedure to approximately solve the above given a linear optimization oracle over the feature space.
Assumption 6 (Linear optimization oracle over the feature space). Learner has access to a linear optimization oracle O lin that on taking input θ ∈ R d , returns a feature ϕ(s ′ , a ′ ) ∈ argmax s,a ⟨θ, ϕ(s, a)⟩.
The key observation we use is that under linear Bellman completeness (Definition 1) and deterministic dynamics (Assumption 1), any solution θ for (3) must satisfy ∑ t-1 i=1 (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 = 0. On the other hand, the converse also holds that any point θ ∈ O(W h ) for which the objective value is 0 must be a solution to (3). Thus, the minimization problem in (3) is equivalent to finding a feasible point within the convex set
K ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ θ ∈ R d (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 = 0 for all i ≤ t |⟨θ, ϕ(s, a)⟩| ≤ W h for all s, a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ . (4
)
Given the above reformulation of the optimization objective (3) as a feasibility problem, we can now use the procedure of Bertsimas & Vempala (2004) for finding θ t,h ∈ K. However, we first need to define a separation oracle for the set K and verify Assumption 5. Unfortunately, there may not exist any r > 0 for which (z + R ∞ (r)) ⊆ K for some z ∈ R d in our case and thus the above K may not satisfy Assumption 5. This can, however, be easily fixed by artificially increasing the set K to allow for some approximation errors. In particular, let ε > 0 and define the convex set
K APX ∶= ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ θ ∈ R d ⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 ) ≤ ε for all i ≤ t ⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 ) ≥ -ε for all i ≤ t |⟨θ, ϕ(s, a)⟩| ≤ W h + ε for all s, a ⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭ .(5)
Clearly, since there exists at least one point θ t,h ∈ K, we must have that
(θ t,h + R ∞ (ε)) ⊆ K APX .
To ensure an outer bounding box for the set K APX , we need to make an additional assumption.
Assumption 7. Let Φ = {ϕ(s, a) | s, a ∈ S × A}. There exist some R ≥ 0 such that 1 R e i ∈ Φ
, where e i denotes the unit basis vector along the i-th direction in R d .
The above assumption ensures that K ⊆ B ∞ (W h R). Recall that we can tolerate the parameter R to be exponential in the dimension d or the horizon H. Finally, a separation oracle can be implemented using O lin (see Algorithm 4 for details). Thus, one can use Algorithm 2 (given in appendix), due to Bertsimas & Vempala (2004), and the guarantee in Theorem 4 to find a feasible point in K APX , which corresponds to an approximate solution to (3). Theorem 5. Let ε > 0, δ ∈ (0, 1), and suppose Assumption 7 holds with some parameter R > 0. Additionally, suppose Assumption 6 holds with the linear optimization oracle denoted by O lin . Then, there exists a computationally efficient procedure (given in Algorithm 4 in the appendix), that for any t ∈ [T ] and h ∈ [H], returns a point θt,h that, with probability at least 1δ, satisfies
t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ T ε and θt,h ∈ O(W h + ε). Furthermore, Algorithm 4 takes O(d 7 log( R δε )) time in addition to O(d log( T HR δε )) calls to O lin .
The above techniques and Algorithm 4 can be similarly extended to get a computationally efficient procedure to estimate the reward parameter in (2). The main difference is that the value of the optimization objective in (2) is not zero at the minimizer (due to stochasticity). Thus, we need to construct a set feasibility problem for every desired target value of the objective function within the grid [0, ε, 2ε, . . . , 2ε, 2] and use a separating hyperplane w.r.t. the ellipsoid constraint in (2) to implement the separating hyperplane for K APX (which can be implemented using projections).

Section: CONCLUSION
In this paper, we develop a computationally efficient RL algorithm under linear Bellman completeness with deterministic dynamics, aiming to bridge the statistical-computational gap in this setting. 
O(W ) {θ ∈ R d ∶ |⟨θ, ϕ(s, a)⟩| ≤ W for all s ∈ S, a ∈ A} R ∞ (W ) {θ ∈ R d ∶ ∥θ∥ ∞ ≤ W } R 2 (W ) {θ ∈ R d ∶ ∥θ∥ 2 ≤ W } η t,h T (ω t,h+1 + θ t,h+1 ) -θt,h η R t,h ω ⋆ h -ωt,h ξ R t ω t,h -ωt,h ξ P t,h θ t,h -θt,h E high
High probability event, defined in Definition 5 E span t Event that trajectory at round t is within the span of historical data, defined in ( 6) 
E
Σ t,h ∑ m i=1 ρ i ϕ i ϕ ⊺ i + ∑ t-1 i=1 ϕ(s i,h , a i,h )ϕ ⊺ (s i,h , a i,h ) Σt,h ∑ t-1 i=1 ϕ(s i,h , a i,h )ϕ ⊺ (s i,h , a i,h ) W h
Recursively defined as
W h-1 = W h + 2ε 2 + √ 2d ⋅ B P noise,h + √ 2d ⋅ B R noise + 1 with W H+1 = 1

Section: B PROOF OVERVIEW
In this section, we provide a sketch of the proof of Theorem 1 (exact oracle and zero inherent linear Bellman error) with the full proofs deferred to Appendix C. To better convey the intuition, we now assume that the reward function is known, as reward learning is largely standard. In particular, we temporarily remove the estimation and perturbation of rewards (Lines 9 and 11) and simply assume ω t,h = ω ⋆ h in Line 12.

Section: B.1 SPAN ARGUMENT
The very first step of our analysis revolve around two complimentary cases -whether the trajectory at round t is in the span of the historical data or not. Let D t,h ∶= {ϕ(s i,h , a i,h )} t i=1 and define E span t as the event that the trajectory at round t is in the span of the historical data, i.e.,
E span t ∶= {∀h ∈ [H] ∶ ϕ(s t,h , a t,h ) ∈ span(D t-1,h )} .(6)
(1) In-span case. When the trajectory generated in the round t is completely within the span of historical data, we can assert that the value function estimation is accurate under π t . Particularly, by linear Bellman completeness, the Bayes optimal of the regression in Line 9 zeros the empirical risk, as formally stated in the following lemma.
Lemma 1. For any t ∈ [T ], we have
∑ t-1 i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 = 0.
Define U t (⋅) as a version of V t (⋅) that minimizes V t (s t,1 ) while satisfying the high probability bound (precise definition provided at the beginning of Appendix C.2). It implies the following.
Lemma 2. For any t ∈ [T ], whenever E span t holds, we have V t (s t,1 ) = U t (s t,1 ) = V πt (s t,1 ).
To understand Lemma 2, we consider two fact: (1) π t is the optimal policy for the estimated value function V t , and (2) both V t and U t has accurate value estimate for the trajectory induced by π t , starting from s t,1 , because it is in the span of the historical data when E span t holds.
(2) Out-of-span case. When any segment of the trajectory is not within the span, we simply pay H in regret and can assert that this will not occur too many times. To see this, we observe the following fact: whenever E span t does not hold, there must exists h ∈ [H] such that dim span(D t,h ) = dim span(D t-1,h ) + 1 by definition. Since the dimension of spans cannot exceed d for any h ∈ [H], the case that E span t does not hold cannot happen for more than dH times. We formally state it in the following lemma.
Lemma 3. We have
∑ T t=1 1{(E span t ) ∁ } ≤ dH.
Hence, we have the following decomposition:
V ⋆ (s t,1 ) -V πt (s t,1 ) = 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 )) + 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))
≤ dH 2 when summed over t Therefore, we only need to focus on the rounds where E span t holds. This will be the aim of the subsequent sections.

Section: B.2 EXPLORATION IN THE NULL SPACE
Lemma 1 implies that the estimation error only comes from the null space of the historical data, i.e., null({ϕ(s i,h , a i,h ) ∶ i = 1, . . . , t -1}). Therefore, we only need to explore in this null space. While adding explicit bonus is infeasible under linear Bellman completeness, we add noise (Line 11) that can cancel out the estimation error in the null space. This achieves the following:
Lemma 4 (Optimism with constant probability). Denote E optm t as the event that V ⋆ (s t,1 ) ≤ V t (s t,1 ). Then, for any t ∈ [T ], we have Pr(E optm t ) ≥ Γ 2 (-1) where Γ is the cumulative distribution function of the standard normal distribution.
This result has been the key idea in randomized RL algorithms, such as RLSVI. In the next section, we will explore how this lemma is utilized.

Section: B.3 PROOF OUTLINE
In this section, we outline the structure of the whole proof. Let Ṽ denote an i.i.d. copy of V , and Ẽspan t , Ẽoptm t denote the counterpart of E span t , E optm t for Ṽ . We first invoke Lemma 2 and get
1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 )) = 1{E span t }(V ⋆ (s t,1 ) -U t (s t,1 )) ≤ V ⋆ (s t,1 ) -1{E span t }U t (s t,1 )
where the last step is by the non-negativity of V ⋆ . Next, we apply Lemma 4 and get
≤ E Ṽt [ min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 ) | Ẽoptm t ]
Split it into two parts:
= E Ṽt [1{ Ẽspan t }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )) | Ẽoptm t ] + E Ṽt [1{( Ẽspan t ) ∁ }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )) | Ẽoptm t ]
Note that the quantity inside the first expectation is non-negative, so we can peel off the conditioning event; the quantity in the second term is simply upper bounded by H. Hence, we have
≤ 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 ))] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H]
Now we split the first term into two parts again:
= 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t } min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ ∩ E span t }U t (s t,1 )] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H] ≤ 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t } min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )] + 2 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H]
where we used the fact that 1{E span t }U t (s t,1 ) ≤ H. Taking the expectation over the randomness of the algorithm and use the tower property, which converts Ṽ into V , we obtain
≤ 1 Γ 2 (-1) E [1{E span t } min{V t (s t,1 ), H} -1{E span t }U t (s t,1 )] + 2 Γ 2 (-1) E [1{(E span t ) ∁ }H]
The first term is upper bounded by zero due to Lemma 2, and the second term is upper bounded by dH 2 by Lemma 3 when summed over t. This finishes the proof.
Remark 1 (Span Argument and Exponential Blow-Up). In the proof sketch above, we did not utilize any ℓ 2 -norm bound on θ t,h or θt,h as did in many prior works. We actually cannot leverage them since they can be exponentially large due to the addition of exponentially large noise. This phenomenon is widely observed in the literature (e.g., Agrawal et al. (2021); Zanette et al. (2020a)) and is addressed through truncation. However, truncation does not work under linear Bellman completeness, as the Bellman backup of a truncated value function is not necessarily linear. This is why we use the span argument to circumvent this issue.

Section: C FULL PROOF FOR SECTION 5
In this section, we present and prove the following main theorem, which provides the regret bound in terms of parameters ε 1 , ε 2 , and ε B . Setting ε 1 = ε 2 = ε B = 0 yields Theorem 1, setting ε B = 0 yields Theorem 2, and setting ε 1 = ε 2 = 0 yields Theorem 3. Theorem 6. Assume the MDP has ε B -inherent linear Bellman error. Under Assumptions 1, 3 and 4, when executing Algorithm 1 with parameters σ R = √ HB R err and
σ h ≥ √ H( √ 3γB P err + √ 8m(W h + ε 2 )), we have E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T ) + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT )).
Exact value of parameters σ R and σ h in Theorem 6. We define W H+1 = 1 and recursively define
W h-1 = W h + 2ε 2 + √ 2d ⋅ B P noise,h + √ 2d ⋅ B R noise + 1.
Plugging the definition of all these symbols involved and ignoring lower order terms (i.e., logarithmic and constant terms), we get
W h-1 ≈ d √ mH ⋅ W h + ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 . (7
)
Solving this recursion, we get
W h ≈ (d √ mH) H+1-h + (d √ mH) H-h (ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 ) ≈ (d √ mH) H-h (ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 + d √ mH).
We insert this into the value of σ h and get
σ h ≈ (d √ mH) H-h+1 (ε 1 ⋅ γ √ H + ε B ⋅ γ √ HT + √ ε B ⋅ √ T + d 1/2 + √ mH).
We can also get the value of σ R as
σ R ≈ √ H( √ d log(HT ) + ε 1 + √ ε B T ). Define Λ = ∑ m i=1 ρ i ϕ i ϕ ⊺ i .
It is straightforward that both Λ and Λ t,h (constructed in Line 7 of Algorithm 1) are invertible. We define λ ∶= max s,a ∥ϕ(s, a)∥ Λ -1 and λ t,h ∶= max s,a ∥ϕ(s, a)∥ Λ -1 t,h . Lemma 5. The matrices Λ and Λ t,h are invertible. Furthermore, we also have that
• λ ≤ √ d; • λ t,h ≤ √ 2d for all t ∈ [T ] and all h ∈ [H].
Proof of Lemma 5. By the last item in Lemma 23, we have λ ≤
√ d.
In what follows, we will show that Λ ⪯ 2Λ t,h , which implies λ t,h ≤ √ 2λ ≤ √ 2d.
For any x ∈ R d , we have
x ⊺ Λx = m ∑ i=1 ρ i (x ⊺ ϕ i ) 2 = m ∑ i=1 ρ i (x ⊺ ϕ ∥ t,h,i + x ⊺ ϕ ⊥ t,h,i ) 2 ≤ 2 m ∑ i=1 ρ i (x ⊺ ϕ ∥ t,h,i ) 2 + 2 m ∑ i=1 ρ i (x ⊺ ϕ ⊥ t,h,i ) 2 (using (a + b) 2 ≤ 2a 2 + 2b 2 ) = 2x ⊺ Λ t,h x.
This implies that Λ ⪯ 2Λ t,h .

Section: C.1 HIGH-PROBABILITY EVENT AND BOUNDEDNESS
Lemma 6 (Reward estimation). With probability at least 1δ, for any t ∈ [T ] and h ∈ [H],
∥ω t,h -ω ⋆ h ∥ Σt ≤ √ 1030(1 + ε 2 ) 4 d log (8(1 + ε 2 )e 2 T 2 H/δ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ).
Proof of Lemma 6. For the ease of notation, we fixed t and h in the proof and simply write the regression problem as
ω ← argmin ω∈O(1) n ∑ i=1 (ω ⊺ ϕ i -r i ) 2
where we have dropped the subscripts t and h for notational simplicity. Here ϕ i and r i are abbreviated notations for ϕ(s i,h , a i,h ) and r i,h , respectively, and n = t -1. 
ω i = (ω ⊺ ϕ i -r i ) 2 -((ω ⋆ ) ⊺ ϕ i -r i ) 2 . Then we have |z ω i | ≤ 4(1 + ε 2 ) 2
, and
E i [z ω i ] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i )(ω ⊺ ϕ i + (ω ⋆ ) ⊺ ϕ i -2r i )] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i + 2((ω ⋆ ) ⊺ ϕ i -r i ))] ≥ (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 -4(1 + ε 2 )ε B ,
and moreover,
E i [(z ω i ) 2 ] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 (ω ⊺ ϕ i + (ω ⋆ ) ⊺ ϕ i -2r i ) 2 ] ≤ 16(1 + ε 2 ) 2 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2
We note that z ω i -Ei z ω i is a martingale difference sequence and
|z ω i -Ei z ω i | ≤ 8(1 + ε 2 ) 2
. Applying Freedman's inequality (Lemma 22) and a union bound over ω ∈ C, we have with probability at least 1δ, for all ω ∈ C,
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 - n ∑ i=1 z ω i ≤ η n ∑ i=1 16(1 + ε 2 ) 2 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 + 8(1 + ε 2 ) 2 log(|C|/δ) η + 4(1 + ε 2 )ε B T.(9)
Recall that ω is the least square solution. Denote ω ∈ C as the point that is closest to ω, which means that:
∑ n i=1 |ω ⊺ ϕ i -ω⊺ ϕ i | ≤ nα.
We can derive the following relationship between ω and ω:
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 n ∑ i=1 (ω ⊺ ϕ i -ω⊺ ϕ i ) 2 + 2 n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 2 n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 , n ∑ i=1 z ω i - n ∑ i=1 z ω i = n ∑ i=1 (ω ⊺ ϕ i -ω⊺ ϕ i )(ω ⊺ ϕ i + ω⊺ ϕ i -2r i ) ≤ 4(1 + ε 2 )nα.
Now plug ω into (9) and re-arrange terms, we get:
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 1 1 -16(1 + ε 2 ) 2 η n ∑ i=1 z ω i + 8(1 + ε 2 ) 2 η(1 -16(1 + ε 2 ) 2 η) ⋅ log(|C|/δ) + 4(1 + ε 2 )ε B T 1 -16(1 + ε 2 ) 2 η . Setting η = (32(1 + ε 2 ) 2 ) -1 , we get n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 n ∑ i=1 z ω i + 512(1 + ε 2 ) 4 log(|C|/δ) + 8(1 + ε 2 )ε B T.
Using the relationships between ω and ω that we derived above, we have:
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 4 n ∑ i=1 z ω i + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )ε B T. ≤ 2n 2 α 2 + 4 n ∑ i=1 z ω i + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )nα + 16(1 + ε 2 )ε B T.
Since ω is the (approximate) least square solution, we have ∑ i z ω i ≤ ε 2 1 . This implies that:
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 4ε 2 1 + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )(nα + ε B T ).
Now plugging the covering number (8) and setting α = 1/n, we obtain
n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 + 4ε 2 1 + 1024(1 + ε 2 ) 4 d log(8(1 + ε 2 )e 2 n/δ) + 16(1 + ε 2 )(1 + ε B T ) ≤ 1026(1 + ε 2 ) 4 d log(8(1 + ε 2 )e 2 n/δ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ). Finally, we have ∥ω -ω ⋆ h ∥ 2 Σt = n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 + m ∑ i=1 ρ i (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 .
Here, with some abuse of notation, the ϕ i 's in the right term are the support points of the optimal design. The first term is already bounded above. The second term can be bounded by
m ∑ i=1 ρ i (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ m ∑ i=1 ρ i ⋅ 4(1 + ε 2 ) = 4(1 + ε 2 ).
We add it into the constant of the first term. Then, we apply the union bound over all t ∈ [T ] and h ∈ [H] to get the desired result.
Lemma 7 (Value function estimation). Suppose that T (ω t,h + θ t,h+1 ) ∈ O(W h ). Then,
t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ ε 2 1 + T ε 2 B . Furthermore, ∥ θt,h -T (ω t,h + θ t,h+1 )∥ Σt,h ≤ √ 2ε 2 1 + 4T ε 2 B =∶ B P err .
Proof of Lemma 7. The Bayes optimal T (ω t,h + θ t,h+1 ) should achieve the empirical risk of at most ε B , i.e.,
∀i
∈ [t -1] ∶ |⟨ϕ(s i,h , a i,h ), T (ω t,h + θ t,h+1 )⟩ -V t,h+1 (s i,h+1 )| ≤ ε B .
Since T (ω t,h + θ t,h+1 ) is realizable (i.e., T (ω t,h + θ t,h+1 ) ∈ O(W h )), and θt,h minimizes the objective up to precision ε 1 , it should satisfy the following
t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ ε 2 1 + T ε 2 B .
Combining the above two results, we arrive at the following:
t-1 ∑ i=1 ⟨ϕ(s i,h , a i,h ), θt,h -T (ω t,h + θ t,h+1 )⟩ 2 ≤ 2 t-1 ∑ i=1 (⟨ϕ(s i,h , a i,h ), θt,h ⟩ -V t,h+1 (s i,h+1 )) 2 + 2 t-1 ∑ i=1 (V t,h+1 (s i,h+1 ) -⟨ϕ(s i,h , a i,h ), T (ω t,h + θ t,h+1 )⟩) 2 (using (a + b) 2 ≤ 2a 2 + 2b 2 ) ≤ 2ε 2 1 + 4T ε 2 B . This implies that ∥ θt,h -T (ω t,h + θ t,h+1 )∥ 2 Σt,h ≤ 2ε 2 1 + 4T ε 2 B .
Definition 5 (High-probability events). Define event E high as
E high ∶= {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ξ P t,h ∥ Λ t,h ≤ σ h √ 2d log(6dH 2 T 2 ) =∶ B P noise,h } ∩ {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ξ R t,h ∥ Σ t,h ≤ σ R √ 2d log(6dHT 2 ) =∶ B R noise } ∩ {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥η R t,h ∥ Σ t,h ≤ B R err } where B R err ∶= √ 1030(1 + ε 2 ) 4 d log (24(1 + ε 2 )e 2 T 3 H 2 ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ).
Lemma 8. We have Pr(E high ) > 1 -1/(HT ).
Proof of Lemma 8. Below we show that each event defined in Definition 5 holds with probability at least 1 -1/(3HT ). Then, by union bound, we have the desired result.
Proof of the first event. The way we generate ξ P t,h is equivalent to first sampling ζ t,h ∼ N (0, (σ h ) 2 Λ -1 t,h ) and then set ξ P t,h ← (I -P t,h )ζ t,h . By Lemma 20 and the union bound, we have
Pr (∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ζ t,h ∥ Λ t,h > σ h √ 2d log(6dH 2 T 2 )) ≤ 1/(3HT ).
Then, by definition, we have
∥ξ P t,h ∥ 2 Λ t,h = ∥(1 -P t,h )ζ t,h ∥ 2 Λ t,h = ζ ⊺ t,h (I -P t,h ) m ∑ i=1 (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ )(I -P t,h )ζ t,h = ζ ⊺ t,h m ∑ i=1 ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ ζ t,h ≤ ζ ⊺ t,h m ∑ i=1 (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ )ζ t,h
where the third step holds by the fact that ϕ ⊥ is in the null space and ϕ ∥ is in the span. Hence, we conclude that ∥ξ P t,h ∥ Λ t,h ≤ ∥ζ t,h ∥ Λ t,h . Proof of the second event. Applying Lemma 20 and the union bound, we have
Pr (∀t ∈ [T ] ∶ ∥ξ R t ∥ Σt > σ R √ 2d log(6dHT 2 )) ≤ 1/(3HT ).
Proof of the third event. This is directly from Lemma 6.
Lemma 9 (Boundness of parameters). Under Assumption 4, conditioning on E high , the following hold for all t ∈ [T ] and h ∈ [H]:
1. max s,a |⟨ϕ(s, a), θt,h ⟩| ≤ W h + ε 2 ; 2. max s,a |⟨ϕ(s, a), T (ω t,h + θ t,h+1 )⟩| ≤ W h ; 3. ∥η t,h ∥ Σt,h ≤ B P err ; 4. ∥η t,h ∥ Λ ≤ 2(W h + ε 2 ) √ m; 5. ∥η t,h ∥ Λ t,h ≤ √ 3γB P err + √ 8m(W h + ε 2 ) ; 6. max s,a |⟨ϕ(s, a), θ t,h ⟩| ≤ W h-1 - √ 2d ⋅ B R noise -1 -ε 2 7. max s V t,h (s) = max s,a |Q t,h (s, a)| ≤ W h-1 .
Proof of Lemma 9. Fix t ∈ [T ]. We prove these items by induction on h. The base case (h = H +1) clearly holds since there is actually nothing at (H + 1)-th step. Now assume they hold for h + 1, and we will show that they hold for h as well.
Proof of Item 1. It is simply by Line 9 of Algorithm 1 and Assumption 3.
Proof of Item 2. By linear Bellman completeness (Definition 1), for any s, a, we have,
|⟨ϕ(s, a), T (ω t,h + θ t,h+1 )⟩| = | E s ′ ∼T(s,a) max a ′ ⟨ϕ(s ′ , a ′ ), ω t,h + θ t,h+1 ⟩| ≤ max s,a |⟨ϕ(s, a), ω t,h + θ t,h+1 ⟩| ≤ max s,a |⟨ϕ(s, a), ωt,h ⟩| + max s,a |⟨ϕ(s, a), ξ R t,h ⟩| + max s,a |⟨ϕ(s, a), θ t,h+1 ⟩| ≤ (1 + ε 2 ) + max s,a ∥ϕ(s, a)∥ Σ -1 t,h ∥ξ R t,h ∥ Σ t,h + (W h - √ 2d ⋅ B R noise -1 -ε 2 ) ≤ 1 + ε 2 + √ 2d ⋅ B R noise + (W h - √ 2d ⋅ B R noise -1 -ε 2 ) = W h .
Proof of Item 3. This is directly from Lemma 7.
Proof of Item 4. By triangle inequality, we have
∥η t,h ∥ Λ = ∥ θt,h -T (ω t,h + θ t,h+1 )∥ Λ ≤ ∥ θt,h ∥ Λ + ∥T (ω t,h + θ t,h+1 )∥ Λ ≤ 2(W h + ε 2 ) √ m.
where the last step is by
∥ θt,h ∥ Λ = m ∑ i=1 ⟨ϕ i , θt,h ⟩ 2 ≤ m ∑ i=1 (W h + ε 2 ) 2 = (W h + ε 2 ) √ m
and the similar for T (ω t,h + θ t,h+1 ).
Proof of Item 5. By definition, we have 
∥η t,h ∥ 2 Λ t,h = m ∑ i=1 ρ i (⟨ϕ ∥ t,h,i , η t,h ⟩ 2 + ⟨ϕ ⊥ t,h,i , η t,h ⟩ 2 ) = m ∑ i=1 ρ i (⟨P t,h ϕ i , η t,h ⟩ 2 + ⟨(I -P t,h )ϕ i , η t,h ⟩ 2 ) ≤ m ∑ i=1 ρ i (3⟨P t,h ϕ i , η t,h ⟩ 2 + 2⟨ϕ i , η t,h ⟩ 2 ) (using (a + b) 2 ≤ a 2 + b 2 ) ≤ 3 m ∑ i=1 ρ i ( ∥ϕ ∥ t,h,i ∥ 2 Σ † t,h ∥η t,h ∥ 2 Σt,h ) + 2∥η t,h ∥ 2 Λ (Cauchy-Schwartz, Lemma 25) We have ∥ϕ ∥ t,h,i ∥ Σ † t,h = ∥P t,h ϕ i ∥ Σ † t,h = ∥ϕ i ∥ Σ †
∥η t,h ∥ 2 Λ t,h ≤ 3γ 2 (B P err ) 2 + 2∥η t,h ∥ 2 Λ ≤ 3γ 2 (B P err ) 2 + 8(W h + ε 2 ) 2 m. (Item 4)
Proof of Item 6. We have
max s,a |⟨ϕ(s, a), θ t,h ⟩| = max s,a |⟨ϕ(s, a), θt,h + ξ P t,h ⟩| ≤ max s,a |⟨ϕ(s, a), θt,h ⟩| + max s,a |⟨ϕ(s, a), ξ P t,h ⟩| ≤ W h + ε 2 + max s,a ∥ϕ(s, a)∥ Λ -1 t,h ∥ξ P t,h ∥ Λ t,h ≤ W h + ε 2 + √ 2d ⋅ B P noise,h (Lemma 5) = W h-1 - √ 2d ⋅ B R noise -1 -ε 2 .
Proof of Item 7. We have
|Q t,h (s, a)| = |⟨ϕ(s, a), θ t,h ⟩ + ⟨ϕ(s, a), ω t,h ⟩| ≤ |⟨ϕ(s, a), θ t,h ⟩| + |⟨ϕ(s, a), ωt,h ⟩| + |⟨ϕ(s, a), ξ R t ⟩| ≤ (W h-1 - √ 2d ⋅ B R noise -1 -ε 2 ) + (1 + ε 2 ) + √ 2d ⋅ B R noise = W h-1 .
and also, max
s |V t,h (s)| = max s,a |Q t,h (s, a)| ≤ W h-1 .

Section: C.2 VALUE DECOMPOSITION
We note that, at any round t ∈ [T ], conditioning on all information collected up to round t -1, the randomness of V t only comes from the Gaussian noise {ξ P t,h , ξ R t,h } H h=1 . In other words, V t can be considered a functional of the Gaussian noise. In light of this, we define 
V t,
s.t. ∀h ∈ [H] ∶ ∥ ξP t,h ∥ Λ t,h ≤ B P noise,h , ∥ ξR t,h ∥ Σ t,h ≤ B R noise .
In other words, U t achieves the minimum value at s t,1 while satisfying the high-probability constraints (E high ) on the noise variable. We denote ξ P 1 , . . . , ξ P H , ξ R 1 , . . . , ξ R H as the minimizer of the above program, and will always use underlined variables to represent the intermediate variables corresponding to U t (such as θ, θ, ω, ω, Q, V ) to distinguish them from the variables corresponding to V t , ( θ, θ, ω, ω, Q, V ). We note that, under E high , we directly have U t (s t,1 ) ≤ V t (s t,1 ).
Below is a decomposition lemma under deterministic transition. Note that it slightly differs from the usual value decomposition lemma under stochastic transitions, where we have to take the expectation over trajectory randomness. This distinction is crucial to our analysis: by not accounting for trajectory randomness, we can effectively leverage our span argument.
We denote {s t,h , a t,h } H h=1 as the trajectory generated by executing π t with initial state s t,1 , and {s ⋆ t,h , a ⋆ t,h } H h=1 as the trajectory generated by executing π ⋆ with initial state s ⋆ t,1 = s t,1 . Lemma 10 (Value decomposition under deterministic transition). Under deterministic transition (Assumption 1), we have
V πt (s t,1 ) -V t (s t,1 ) = H ∑ h=1 (V t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩); (10) V ⋆ (s t,1 ) -V t (s t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨θ t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩).
(11)
Similarly, we have
V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩). (12
)
Proof of Lemma 10. We will prove ( 10) and ( 11) altogether, and then prove (12).
Proof of (10) and (11). We consider an arbitrary policy π. Let {s ′ t,h , a ′ t,h } H h=1 denote the deterministic trajectory generated by π with initial state s ′ t,1 = s t,1 . By definition, we have
V π (s ′ t,1 ) -V t (s ′ t,1 ) = Q π 1 (s ′ t,1 , π(s ′ t,1 )) -max a Q t,1 (s ′ t,1 , a) ≤ Q π 1 (s ′ t,1 , π(s ′ t,1 )) -Q t,1 (s ′ t,1 , π(s ′ t,1 )) (13) = V π 2 (s ′ t,2 ) + r h (s ′ t,1 , a ′ t,1 ) -⟨θ t,1 , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩ -⟨ω t,h , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩ (by definition) = (V π 2 (s ′ t,2 ) -V t,2 (s ′ t,2 )) + (V t,2 (s ′ t,2 ) -⟨θ t,1 , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩) + ⟨ω ⋆ h -ω t,h , ϕ(s ′ t,1 , a ′ t,1 )⟩
Recursively expanding the first term, we obtain
V π (s ′ t,1 ) -V t (s ′ t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ′ t,h+1 ) -⟨θ t,h , ϕ(s ′ t,h , a ′ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ′ t,h , a ′ t,h )⟩).
This proves (11) by specifying π = π ⋆ . Similarly, (10) can be proved by observing that the only inequality ( 13) becomes equality when π = π t .
Proof of (12). The proof is quite similar. We have
V πt (s t,1 ) -U t (s t,1 ) = Q πt 1 (s t,1 , π t (s t,1 )) -max a Q t,1 (s t,1 , a) ≤ Q πt 1 (s t,1 , π t (s t,1 )) -Q t,1 (s t,1 , π t (s t,1 )) = V πt 2 (s t,2 ) + r h (s t,1 , a t,1 ) -⟨θ t,1 , ϕ(s t,1 , π t (s t,1 ))⟩ -⟨ω t,h , ϕ(s t,1 , a t,1 )⟩ (by definition) = (V πt 2 (s t,2 ) -U t,2 (s t,2 )) + (U t,2 (s t,2 ) -⟨θ t,1 , ϕ(s t,1 , π t (s t,1 ))⟩) + ⟨ω ⋆ h -ω t,h , ϕ(s t,1 , a t,1 )⟩
Recursively expanding the first term, we obtain
V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩).
This completes the proof.
Lemma 11. For any t ∈ [T ], conditioning on E span t , we have the following (in)equalities: Proof of Lemma 11. We will prove the two statements separately, but the proofs are quite similar.
V t (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1
Proof of the first statement. By Lemma 10, we have
V t (s t,1 ) -V πt (s t,1 ) = H ∑ h=1 (⟨ θt,h , ϕ(s t,h , a t,h )⟩ + ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ -V t,h+1 (s t,h+1 ) + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩)
By linear Bellman completeness (Definition 1), there exists a vector, denoted by T (θ t,h+1 + ω t,h+1 ), such that V t,h+1 (⋅) = ⟨ϕ(⋅, a), T (θ t,h+1 + ω t,h+1 )⟩. Hence, we can rewrite the above as
V t (s t,1 ) -V πt (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩).
Note that by definition of V πt we have V πt (s t,1 ) = ∑ H h=1 ⟨ω ⋆ , ϕ(s t,h , a t,h )⟩. Hence, the above implies
V t (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ) + ξ P t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω t,h , ϕ(s t,h , a t,h )⟩).
We can remove ξ P t,h since ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ = 0 conditioning on E span t .
Proof of the second statement. By Lemma 10, we have
V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩)
Again, by the definition of V πt , we conclude that
U t (s t,1 ) ≥ H ∑ h=1 (⟨ω t,h , ϕ(s t,h , a t,h )⟩ + ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ) + ξ P t,h , ϕ(s t,h , a t,h )⟩).
We can remove ξ P t,h since ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ = 0 conditioning on E span t .
The following lemma shows that, conditioning on the span event E span t , the value function V t cannot deviate too much from the value function V πt on average. Lemma 12. For any t ∈ [T ], under Assumption 4 and conditioning on E span t and E high , we have
T ∑ t=1 (V t (s t,1 ) -V πt (s t,1 )) ≤ B P err γH + (B R noise + B R err ) ⋅ B R ϕ .
Proof of Lemma 12. We apply Lemma 11 to decompose V t and obtain
T ∑ t=1 (V t (s t,1 ) -V πt (s t,1 )) = T ∑ t=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩)
Applying Cauchy-Schwartz yields
≤ T ∑ t=1 (∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h , ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h
)
We apply Lemma 7 and Assumption 4 to the left term and Lemmas 6 and 16 and Definition 5 to the right. Then, we obtain
≤ H ⋅ B P err γ + (B R noise + B R err ) ⋅ B R ϕ .
This completes the proof.
The following lemma establishes upper bounds on the value functions when conditioning on the span event E span t .
Lemma 13. For any t ∈ [T ], conditioning on E span t and E high , we have
|U t (s t,1 )| ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ). Moreover, we have |V t (s t,1 )| ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ).
We abbreviate
B V ∶= H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ).
Proof of Lemma 13. We will first prove the second statement and then the first statement.
Proof of the second statement. Applying Lemma 11 and the triangle inequality, we have the following
|V t (s t,1 )| ≤ | H ∑ h=1 ⟨ω t,h , ϕ(s t,h , a t,h )⟩| + | H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩| =∶ T 1 + T 2 .
We bound the two terms separately. For T 1 , we have
T 1 = | H ∑ h=1 ⟨(ω t,h -ωt,h ) + (ω t,h -ω ⋆ ) + ω ⋆ h , ϕ(s t,h , a t,h )⟩| ≤ H ∑ h=1 (∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ) ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h + V πt (Cauchy-Schwartz) ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H.
(Definition 5 and lemma 5)
For T 2 , we can use Cauchy-Schwartz:
T 2 = | H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩| ≤ H ∑ h=1 ∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h
(Cauchy-Schwartz, Lemma 25)
≤ B P err γH.
(Assumption 4 and lemma 7)
Proof of the first statement. We prove it by establishing a lower bound and an upper bound of U t (s t,1 ) separately. We start with the lower bound, whose derivation is similar to the second statement we just proved above:
U t (s t,1 ) ≥ H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ω t,h , ϕ(s t,h , a t,h )⟩) (Lemma 11) ≥ -B P err γH -| H ∑ h=1 ⟨(ω t,h -ωt,h ) + (ω t,h -ω ⋆ h ) + ω ⋆ h , ϕ(s t,h , a t,h )⟩|
(following a similar argument as above)
≥ -B P err γH - H ∑ h=1 (∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ) ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h (Cauchy-Schwartz) ≥ -B P err γH -H ⋅ (B R noise + B R err ) ⋅ √ d. (Lemma 8)
The upper bound of U t (s t,1 ) is a consequence of the second statement we just proved above:
U t (s t,1 ) ≤ E[V t (s t,1 ) | E high ]
(by definition)
≤ B P err γH + H ⋅ (B R noise + B R err ) ⋅ √ d + H.
We finish the proof by combining the lower and upper bounds.

Section: C.3 EXPLORATION IN THE NULL SPACE
Lemma 14 (optimism with constant probability). For any t ∈ [T ], denote E optm t as the event that V ⋆ (s t,1 ) ≤ V t (s t,1 ) + B P err γH. Then, under Assumption 4 and conditioning on the high-probability event E high , we have
Pr (E optm t ) ≥ Γ 2 (-1)
where Γ(⋅) is the CDF of the standard normal distribution.
Proof of Lemma 14. By Lemma 10, we have:
V ⋆ (s t,1 ) -V t (s t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨θ t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩) = H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨ θt,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩) (i) - H ∑ h=1 ⟨ξ P t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + H ∑ h=1 ⟨ω ⋆ h -ωt,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ (iii) - H ∑ h=1 ⟨ξ R t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ (iv)
.
Note that, given any state-action-state triple (s, a, s ′ ), we have
V t,h+1 (s ′ ) -⟨ θt,h , ϕ(s, a)⟩ = ⟨T (ω t,h+1 + θ t,h+1 ) -θt,h , ϕ(s, a)⟩ = ⟨η t,h , ϕ(s, a)⟩.
Plugging this back to (i), we obtain
(i) -(ii) ≤ H ∑ h=1 ⟨η t,h -ξ P t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ =∶ H ∑ h=1 ⟨η t,h -ξ P t,h , ϕ ⋆ h ⟩
where we abbreviate ϕ ⋆ h ∶= ϕ(s ⋆ t,h , a ⋆ t,h ). Next, we split it into two parts:
(i) -(ii) ≤ H ∑ h=1 ⟨η t,h , P t,h ϕ ⋆ h ⟩ + H ∑ h=1 ⟨η t,h , (I -P t,h )ϕ ⋆ h ⟩ - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ ≤ H ∑ h=1 ∥η t,h ∥ Σt,h ∥P t,h ϕ ⋆ h ∥ Σ † t,h + H ∑ h=1 ∥η t,h ∥ Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ (Cauchy-Schwartz, Lemma 25) ≤ B P err γH + H ∑ h=1 ∥η t,h ∥ Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩
(Assumption 4 and Lemmas 7 and 26)
≤ B P err γH + H H ∑ h=1 ∥η t,h ∥ 2 Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ 2 Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ (Cauchy-Schwartz)
Recall that ξ P t,h is sampled from N (0, σ 2 h (I -P t,h )Λ -1 t,h (I -P t,h )). Therefore,
H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ ∼ N (0, H ∑ h=1 σ 2 h ∥(I -P t,h )ϕ ⋆ h ∥ 2 Λ -1 t,h
) .
Since σ h ≥ √ H∥η t,h ∥ Λ t,h under high-probability event E high , we have
Pr ((i) -(ii) ≤ B P err γH) ≥ Γ(-1).
Next, we consider (iii) -(iv). By a similar argument, we have
(iii) -(iv) = H ∑ h=1 ⟨ω ⋆ h -ωt,h , ϕ ⋆ h ⟩ - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩ ≤ H ∑ h=1 ∥ω ⋆ h -ωt,h ∥ Σ t,h ∥ϕ ⋆ h ∥ Σ -1 t,h - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩ ≤ H ⋅ H ∑ h=1 ∥ω ⋆ h -ωt,h ∥ 2 Σ t,h ∥ϕ ⋆ h ∥ 2 Σ -1 t,h - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩. Recall that ξ R t is sampled from N (0, σ 2 R Σ -1 t,h
), and thus, we have
H ∑ h=1 ⟨ξ R t , ϕ ⋆ h ⟩ ∼ N (0, H ∑ h=1 σ 2 R ∥ϕ ⋆ h ∥ 2 Σ -1 t,h
) .
Therefore, since σ R ≥ √ H∥ω ⋆ h -ωt,h ∥ Σt (Lemma 9), we have Pr ((iii) -(iv) ≤ 0) ≥ Γ(-1).
Since the two events are independent, the probability that both events happen is at least Γ 2 (-1).
Published as a conference paper at ICLR 2025 Lemma 15. The number of times E span t does not hold will not exceed dH, i.e.,
T ∑ t=1 1 {(E span t ) ∁ } ≤ dH.
Proof. By definition, when E span t does not hold, there exists h ∈ [H] such that ϕ(s t,h , a t,h ) is not in the span of {ϕ(s i,h , a i,h )} t-1 i=1 . That means, the dimension of the span should increase by exactly one after this iteration, i.e., dim (span ({ϕ(s i,h , a i,h )} t i=1 )) = dim (span ({ϕ(s i,h , a i,h )} t-1 i=1 )) + 1. However, the dimension cannot exceed d, so it can only increase at most d times. This argument holds for any h ∈ [H], and thus, the total number of times E span t does not happen will not exceed dH.
Lemma 16. For any h ∈ [H], it holds that
T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ d √ 2T log(T + 1) =∶ B R ϕ , T ∑ t=1 1{E span t }∥ϕ(s t,h , a t,h )∥ Σ † t,h ≤ γd √ 2dT log (2T γ 2 ) =∶ B P ϕ .
Proof of Lemma 16. We prove the two inequalities separately.
Proof of the first inequality. For any t ∈ [T ] and h ∈ [H], we have the following bound on the norm of features (Lemma 5):
∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ ∥ϕ(s t,h , a t,h )∥ Λ -1 ≤ √ d.
Hence, by Cauchy-Schwartz, we have
T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ T ⋅ T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h = T ⋅ T ∑ t=1 min {∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h , d} ≤ T d ⋅ T ∑ t=1 min {∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h , 1} ≤ √ T d ⋅ 2d log(T + 1) (elliptical potential lemma, Lemma 21) = d √ 2T log(T + 1).
Proof of the second inequality. We divide the rounds into d consecutive blocks, in each of which the rank of Σt,h remains the same. To be specific, let t 1 , t 2 , . . . , t d , t d+1 be a sequence of integers such that for any i ∈ [d] and any t ∈ {t i , t i+1 , . . . , t i+1 -1}, we have rank( Σt,h ) = i.
We will apply the elliptical potential lemma to each block separately. Now let's fix i ∈ [d] and consider the i-th block. Let the reduced eigen-decomposition of Σti,h be Σti,h = U DU ⊺ where U ∈ R d×i and D ∈ R i×i . For each t ∈ {t i , t i+1 , . . . , t i+1 -1}, since ϕ(s t,h , a t,h ) is in the span of Σt,h conditioning on E span t , there exists a vector x t such that ϕ(s t,h , a t,h ) = U x t .
For any t ∈ {t i , t i+1 , . . . , t i+1 -1}, we have
∥ϕ(s t,h , a t,h )∥ 2 Σ † t,h = ϕ(s t,h , a t,h ) ⊺ Σ † t,h ϕ(s t,h , a t,h ) = ϕ(s t,h , a t,h ) ⊺ ⎛ ⎝ Σti,h + t-1 ∑ j=ti ϕ(s j,h , a j,h )ϕ ⊺ (s j,h , a j,h ) ⎞ ⎠ † ϕ(s t,h , a t,h ) = x ⊺ t U ⊺ ⎛ ⎝ U DU ⊺ + t-1 ∑ j=ti U x j x ⊺ j U ⊺ ⎞ ⎠ † U x t = x ⊺ t ⎛ ⎝ D + t-1 ∑ j=ti x j x ⊺ j ⎞ ⎠ -1 x t . Define D t = D + ∑ t-1 j=ti x j x ⊺ j . Hence, we have ti+1-1 ∑ t=ti 1{E span t }∥ϕ(s t,h , a t,h )∥ Σ † t,h = ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ D -1 t .
By Assumption 4, the eigenvalues of D are lower bounded by 1/γ 2 . And clearly, its eigenvalues are upper bounded by t i ≤ T . Therefore, we have
ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ D -1 t ≤ T ⋅ ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ 2 D -1 t = T ⋅ ti+1-1 ∑ t=ti 1{E span t } min {∥x t ∥ 2 D -1 t , γ 2 } ≤ γ T ⋅ ti+1-1 ∑ t=ti 1{E span t } min {∥x t ∥ 2 D -1 t , 1} ≤ γ √ T ⋅ 2d log (T γ 2 (1 + 1/d)) (elliptical potential lemma, Lemma 21) ≤ γ √ T ⋅ 2d log (2T γ 2 ).
This finishes the summation of one block. Notice that we have d such blocks, we complete the proof by multiplying the above by d.

Section: C.4 MAIN STEPS OF THE PROOF
Let Ṽt (s t,1 ) denote an i.i.d. copy of V t conditioned on initial state s t,1 and Ẽoptm t and Ẽhigh denote the counterparts of E optm t and E high but for Ṽt (s t,1 ).
The proof starts with the following decomposition of the regret:
E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [1{(E high ) ∁ } T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [ T ∑ t=1 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))]
We will later show that the second and third terms can be easily bounded separately by observing the following two fact: (1) the probability that E high doesn't hold is very small, and (2) the number of times E span t doesn't hold is also small. Hence, it remains to bound the first term, which is the most challenging. The most of the proof below is devoted to bounding it.
As the first step, we will add some necessary event conditions to the first term, using the following lemma. Lemma 17 (Adding necessary event conditions). It holds that
E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ 1 Γ 2 (-1) E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] + 1 Γ 2 (-1) ⋅ (dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ + dH 2 + 1)
where the expectation EṼ t is taken over the randomness of Ṽt (an i.i.d. copy of V t ) only.
Proof of Lemma 17. We have 
E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 (V ⋆ (s t,
≤ E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt (s t,1 ) -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } (min{H, Ṽt (s t,1 )} -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } (min{H, Ṽt (s t,1 )} -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + B P err γH =∶ T 1 + T 2 + T 3 + B P
err γH. Below we bound each term separately.
Bounding T 1 . To bound T 1 , we will first drop the conditioning event Ẽoptm t to make things cleaner.
To that end, we re-arange it in the following way
T1 = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1{E high } T ∑ t=1 E Ṽt ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1)) + 1{(E span t ) ∁ } ⋅ B V ( * ) Ẽoptm t ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t }1{E span t }(Ut(st,1) -V π t (st,1)) | Ẽoptm t ]] -E [1{E high } T ∑ t=1 E Ṽt [1{(E span t ) ∁ } ⋅ B V | Ẽoptm t ]] =∶ T1.1 + T1.2 + T1.3.
The reason we did this is that we want to make sure ( * ) is non-negative, so we can drop the conditioning event Ẽoptm Hence, for T 1.1 , we can drop the conditioning event using the following rule (for non-negative variable X):
E[X | E] = E[X ⋅ 1{E}]/ Pr(E) ≤ E[X]/ Pr(E)
and using Lemma 14 to get
T1.1 ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1)) + 1{(E span t ) ∁ } ⋅ B V ]] = 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1))]] + 1 Γ 2 (-1) E [1{E high } T ∑ t=1 1{(E span t ) ∁ } ⋅ B V ] ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1))]] + 1 Γ 2 (-1) ⋅ dHB V (Lemma 15)
For T 1.2 , we apply Lemma 12 to get
T 1.2 ≤ E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t }1{E span t }(V t (s t,1 ) -V πt (s t,1 )) | Ẽoptm t ]] (V t ≥ U t conditioning on E high ) ≤ B P err γH + (B R noise + B R err ) ⋅ B R ϕ .
We simply upper bound T 1.3 by zero. Plugging all these upper bounds back, we obtain
T 1 ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt (s t,1 ) -1{E span t }U t (s t,1 ))]] + 1 Γ 2 (-1) ⋅ dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ = 1 Γ 2 (-1) E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] + 1 Γ 2 (-1) ⋅ dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ
This is the final bound of T 1 we need. Next, we go back to bound T 2 and T 3 .
Bounding T 2 . We upper bound the value function inside the expectation by H and obtain
T 2 ≤ H ⋅ E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } | Ẽoptm t ]] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } | Ẽoptm t ]] (dropping E high ) = H ⋅ E [ T ∑ t=1 Pr (( Ẽhigh ) ∁ ∩ Ẽoptm t ) / Pr ( Ẽoptm t )] ≤ HT Γ 2 (-1) ⋅ Pr ((E high ) ∁ ) ≤ 1 Γ 2 (-1)
.
(Lemma 8)
Bounding T 3 . Similar, we upper bound the value function inside the expectation by H and obtain
T 3 ≤ H ⋅ E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } | Ẽoptm t ]] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } | Ẽoptm t ]] (dropping E high ) = H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ ∩ Ẽoptm t }] / Pr( Ẽoptm t )] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ }] / Pr( Ẽoptm t )] ≤ H Γ 2 (-1) ⋅ E [ T ∑ t=1 1{(E span t ) ∁ }] (tower rule) ≤ dH 2 Γ 2 (-1) (Lemma 15)
Plugging all these back, we conclude the proof.
The following lemma refines the event conditions established in Lemma 17 to make the whole thing more manageable.
Lemma 18 (Refining event conditions). It holds that
E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] + dHB V + 2B V /H.
Proof of Lemma 18. We start with refining the event conditions on the first term. We remove unneeded events by splitting the first term into two parts:
E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] = E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] -E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ (E high ) ∁ } Ṽt (s t,1 )]]
Here, using Lemma 13, the last term can be bounded by
-E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ (E high ) ∁ } Ṽt (s t,1 )]] ≤ E [ T ∑ t=1 1{(E high ) ∁ }B V ] ≤ B V /H
where we used Lemma 8 in the last inequality.
Now we seek to remove unneeded event conditions on U t as well. We notice the following decomposition
1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 ) ≥ 1{E span t ∩ E high }U t (s t,1 ) -1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 ) -1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 ).
Plugging this back, we obtain
E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] + E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 )]] + E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 )]] + B V /H
The first term is exactly what we want. Now we bound the middle two terms separately below:
E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }B V ]] (Lemma 13) ≤ T ⋅ Pr(( Ẽhigh ) ∁ )B V ≤ B V /H (Lemma 8)
and for the other term we also have
E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ }]B V ] (Lemma 13) = B V E [ T ∑ t=1 1{(E span t ) ∁ }] (tower rule) ≤ dHB V . (Lemma 15)
Hence, putting all together, we complete the proof.
The following lemma provides a final bound for the first term in Lemma 18.
Lemma 19 (Final bound). It holds that
E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] ≤ 2HB P err B P ϕ + 2(B R err + B R noise ) ⋅ HB R ϕ .
Proof of Lemma 19. By tower rule, we have
E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] = E [ T ∑ t=1 1{E high ∩ E span t }V t (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]
We plug in the result in Lemma 11 and get
≤ E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨T (θ t,h+1 + ω t,h+1 ) -θt,h , ϕ(s t,h , a t,h )⟩] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨ω t,h -ω t,h , ϕ(s t,h , a t,h )⟩]
Applying Cauchy-Schwartz inequality to each term yields
≤ E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h ] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ∥T (θ t,h+1 + ω t,h+1 ) -θt,h ∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h ] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 (∥ω t,h -ω ⋆ h ∥ Σ t,h + ∥ω ⋆ h -ω t,h ∥ Σ t,h )∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ]
The first two terms can be bounded by HB P err B P ϕ using Lemmas 7 and 16. For the last term, conditioning on E high , we have
∥ω t,h -ω ⋆ h ∥ Σ t,h ≤ ∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ≤ B R err + B R noise
and similarly for ∥ω ⋆ hω t,h ∥ Σ t,h . Also, applying Lemma 16, we have
T ∑ t=1 H ∑ h=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ HB R ϕ .
Inserting all these back, we get the upper bound of
2HB P err B P ϕ + 2(B R err + B R noise ) ⋅ HB R ϕ .
Hence, we complete the proof.
Proof of Theorem 6. We have
E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [1{(E high ) ∁ } T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [ T ∑ t=1 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))] =∶ T A + T B + T C .
For T A , by Lemmas 17 to 19 and re-arranging the results, we have
T A ≤ 1 Γ 2 (-1) ⋅ (2B V (dH + 1/H) + HB P err γ + dH 2 + 1 + (B R err + B R noise )(2H + 1)B R ϕ + 2HB P err B P ϕ ) = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T ) + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT ))
For T B , by Lemma 8, we have
T B ≤ HT ⋅ Pr ((E high ) ∁ ) ≤ 1.
For T C , by Lemma 15, we have
T C ≤ H ⋅ E [ T ∑ t=1 1{(E span t ) ∁ }] ≤ dH 2 .
Putting everything together, we complete the proof.

Section: D SUPPORTING LEMMAS
Lemma 20 (Gaussian concentration). (Abeille & Lazaric, 2017) Let x ∼ N (0, cΣ -1 ) for c ∈ R + and Σ a positive definite matrix. Then, for any δ > 0, we have Pr (∥x∥ Σ > √ 2cd log(2d/δ)) ≤ δ
Lemma 21 (Elliptical potential lemma). Assume that X ⊆ {x ∶ ∥x∥ 2 ≤ 1} is compact and span(X) = R d . Let x 1 , . . . , x T ∈ X be a sequence of vectors, Σ 1 be a positive definite matrix with each eigenvalue bounded within the range of [a, b] for some a, b > 0, and Σ t+1 = Σ t + x t x ⊺ t . Then, we have
T ∑ t=1 min {1, x ⊺ t Σ -1 t x t } ≤ 2d log ( b a + T ad ) .
Furthermore, if Σ 1 is constructed via optimal design, i.e., Σ 1 = Ex∼ρ xx ⊺ where ρ ∈ ∆(X) is an optimal design over X, then we have
T ∑ t=1 min {1, x ⊺ t Σ -1 t x t } ≤ 2d log (T + 1) .
Proof of Lemma 21. First we claim that
min {1, x ⊺ t Σ -1 t x t } ≤ 2x ⊺ t Σ -1 t+1 x t(14)
To show this, we use Sherman-Morrison-Woodbury formula (Bhatia, 2013) for rank-one updates to a matrix inverse:
x ⊺ t Σ -1 t+1 x t = x ⊺ t (Σ t + x t x ⊺ t ) -1 x t = x ⊺ t ⎛ ⎝ Σ -1 t - Σ -1 t x t x ⊺ t Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t ⎞ ⎠ x t = ∥x t ∥ 2 Σ -1 t - ∥x t ∥ 4 Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t = ∥x t ∥ 2 Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t .
Now let us consider two cases for the right-hand side of the above:
Case 1 : x ⊺ t Σ -1 t x t ≤ 1. Then, we can lower bound the right-hand side above by ∥x
t ∥ 2 Σ -1 t /2. Case 2 : x ⊺ t Σ -1 t x t ≥ 1.
Then the right-hand side above is directly at least 1/2 since the function x/(1 + x) is increasing in x.
Hence, in both cases, we have
x ⊺ t Σ -1 t+1 x t ≥ min {1, x ⊺ t Σ -1 t x t } /2
, which finishes the proof of ( 14). Since the log-determinant function is concave, we can obtain that log det (Σ t )log det Σ t+1 ≤ tr (Σ -1 t+1 (Σ t -Σ t+1 )) via first-order Taylor approximation. This gives us the following
T ∑ t=1 x ⊺ t Σ -1 t+1 x t = T ∑ t=1 tr (Σ -1 t+1 (Σ t+1 -Σ t )) ≤ T ∑ t=1 (log det Σ t+1 -log det Σ t ) = log ( det Σ T +1 det Σ 1 )
where the last step follows from telescoping. Since each eigenvalue of Σ 1 is lower bounded by a, we have det Σ 1 ≥ a d . Towards an upper bound of det Σ T +1 = det(Σ 1 + ∑ T t=1 x t x ⊺ t ), let (λ 1 , . . . , λ d ) denote the eigenvalues of ∑ T t=1 x t x ⊺ t , and then we have
det (Σ 1 + T ∑ t=1 x t x ⊺ t ) ≤ d ∏ i=1 (b + λ i ) ≤ ( 1 d d ∑ i=1 (b + λ i )) d ≤ (b + 1 d tr ( T ∑ t=1 x t x ⊺ t )) d ≤ (b + T d ) d
Here, the first step is Weyl's inequality, the second step is AM-GM inequality, and the last step is because the trace is bounded by T . Plugging this upper bound back, we have
log ( det Σ T +1 det Σ 1 ) ≤ d log ( b a + T ad ) .
This completes the proof of the first statement.
For the case where Σ 1 is constructed via optimal design, we can rewrite Σ T +1 in the following way:
Σ T +1 = E x∼ρ xx ⊺ + T ∑ t=1 x t x ⊺ t = (T + 1) ( 1 1 + T ⋅ E x∼ρ xx ⊺ + T ∑ t=1 1 1 + T ⋅ x t x ⊺ t ) ( * ) =∶ (T + 1) E x∼ρ ′ xx ⊺
where we consider ( * ) as an expectation of xx ⊺ over a new distribution that we denote by ρ ′ . Recall that Σ 1 is constructed via optimal design, which implies det Σ 1 ≥ det Ex∼ρ ′ xx ⊺ (Lemma 23). This gives us
log ( det Σ T +1 det Σ 1 ) = log ( (T + 1) d det Ex∼ρ ′ xx ⊺ det Σ 1 ) ≤ log ((T + 1) d ) = d log (T + 1) .
This completes the proof. 
′ ∈ C such that ∑ n i=1 |h(z i ) -h ′ (z i )|/n ≤ ε. We define N (ε, H, n) = max Z n ∈Z n N (ε, H, Z n ).
Below we define the pseudo-dimension (Haussler, 2018;Modi et al., 2024). Definition 7 (VC-dimension). For hypothesis class H ⊆ (X → {0, 1}), we define its VCdimension VC-dim(H) as the maximal cardinality of a set X = {x 1 , . . . , x |X| } ⊆ X that satisfies |H X | = 2 |X| (or X is shattered by H), where H X is the restriction of H to X, i.e., {(h (x 1 ) , . . . , h (x |X| )) ∶ h ∈ H}.
Definition 8 (Pseudo-dimension). For hypothesis class H ⊆ (X → R), we define its pseudo dimension Pdim(H) as Pdim(H) = VCdim (H + ), where
H + = {(x, ξ) ↦ 1[h(x) > ξ] ∶ h ∈ H} ⊆ (X × R → {0, 1})
The following lemma provides a bound on the covering number of a hypothesis class via pseudo dimension. 
N (ε, H, n) ≤ (4e 2 (b -a)/ε) d .
Note that the right-hand side is independent of n.

Section: E LINEAR MDPS AND LQRS IMPLY LINEAR BELLMAN COMPLETENESS
It is already well known that linear Bellman completeness captures linear MDPs, as demonstrated in works such as Agarwal et al. (2019);Zanette et al. (2020b). Here, we show that it also captures LQRs for a convex subset of linear functions (specifically, when the value function is parameterized by a PSD matrix). We start with the definition. Definition 9 (Linear Quadratic Regulator). A linear quadratic regulator (LQR) problem is defined by a tuple (A, B, Q, R) where A ∈ R d×d , B ∈ R d×m , Q ∈ R d×d , and R ∈ R m×m . The objective is to find a policy π that minimizes the following:
J(π) = E [ H ∑ h=1 x ⊺ h Qx h + u ⊺ h Ru h ]
where x h+1 = Ax h + Bu h + w h where w h ∼ N (0, Σ).
Let us focus on an arbitrary step h and simply write the transition as the following (ignoring the subscript h for notational simplicity):
x ′ = Ax + Bu + w, where w ∼ N (0, Σ).
We consider state-action value functions of the form: (1) P uu is PSD, and (2) its Schur complement P xx -P xu P -1 uu P ux is PSD. We note that such a set of feasible P is a convex set.
Q(x, u) = [ x u
Then, we consider the Bellman backup of Q (ignoring the per-step reward (cost) for now): 
Q(x, u) = E x ′ [min u ′ Q(x ′ , u ′ )](17)
Using first-order condition, we know that the optimal u ′ (for a fixed w) satisfies u ′ = -P -1 uu P ux (Ax + Bu + w),
which implies that, the term in ( 17) is equal to where c ′ is some constant. The middle matrix above is PSD if P xx ⪰ P xu P -1 uu P ⊺ xu , which holds since P is PSD. Thus, we conclude that Q is also linear for some PSD matrix.
min u ′ Q(x ′ , u ′ ) = [Ax + Bu + w]
We can also easily verify that the reward (cost) function is linear in the quadratic feature. Hence, we complete the proof.

Section: F COMPUTATIONALLY EFFICIENT IMPLEMENTATIONS FOR OPTIMIZATION ORACLES
The convex programming algorithm given in Algorithm 2 is due to Bertsimas & Vempala (2004).
In the following, we provide an informal description of Algorithm 2 below but refer the reader to Bertsimas & Vempala (2004) for the full details.
At an iteration t ≤ T , Algorithm 2 stars with a set D t which contains the set K, and a set of 2N points U t sampled (approximately) uniformly from D t using the SAMPLER subroutine in Algorithm 3. It then uses the first N samples from U t to compute an approximate centroid z t of the set D t in Line 23; the remaining points from U t are denoted by V t . It then queries the separation oracle O sep K at the point z t . If z t ∈ K, then we terminate and return z t . Otherwise, we use the separating hyperplane between z t and K to shrink the set D t further into D t+1 in Line 29. Finally, it calls SAMPLER again using the set of points V t as a warm start to get 2N new (approximately) i.i.d. sample from D t+1 in Line 30. Equipped with the sets D t+1 and U t+1 , another iteration of the algorithm follows.
On receiving a convex set D and a set of points V, the SAMPLER protocol in Algorithm 3 first refines V to V ′ by disposing off any points z ∈ V that do not lie in D. Then, it starts a random ball walk from the samples in V ′ : in order to update the current point ẑ we first sample a point z ′ uniformly from the ellipsoid ẑ + ηΛ 1 /2 B d (1) (where Λ is defined using the points in V ′ ) and then updates ẑ ← z ′ if z ′ ∈ D. The analysis of Bertsimas & Vempala (2004) shows that this ball walk mixes fast to a uniform distribution over the set D.
G MISSING DETAILS FROM SECTION 6.2

Section: G.1 COMPUTATIONALLY EFFICIENT ESTIMATION OF REWARD FUNCTION (EQN. 2)
The convex set feasibility procedure of Bertsimas & Vempala (2004) can also be used to estimate the parameters for the reward functions in Equation (2) in Algorithm 1. Note that for any time t and Algorithm 4 Computationally Efficient Implementation of O sq apx for Value Estimation Require: • Data samples {(s i , a i , u i )} i≤t .
• Convex domain O(W ).
• Approximation parameter ε.
• Linear optimization oracle O lin defined in Assumption 6.  and stopping at the smallest point ∆ for which K ∆ APX has a feasible solution. It is easy to see that for any ∆, either K ∆ APX is empty or the shifted cube ωt,h + R ∞ (ε) ⊆ K ∆ APX . Furthermore, under Assumption 7 we also have that K ∆ APX ⊆ R ∞ (R) for any ∆. Thus, for any ∆, whenever a feasible solution exists, the set K ∆ APX satisfies the prerequisites for Theorem 4, where recall that we can tolerate the parameter R to be exponential in the dimension d or the horizon H. Furthermore, a separation oracle O sep K ∆ APX can be easily implemented by using the linear optimization oracle O lin w.r.t. the feature space (Assumption 6) and by explicitly constructing a separation oracle for the ellipsoidal constraint
t-1 ∑ i=1 (⟨ω, ϕ(s i,h , a i,h )⟩ -r i,h ) 2 ≤ ∆ + ε.
We provide the implementation of the above in Algorithm 5, which relies on Algorithm 2 for solving the corresponding set feasibility problems. The guarantee in Theorem 4 to find a feasible point in K ∆ APX (for each ∆) gives the following guarantee on computational efficiency for Algorithm 5. Theorem 7. Let ε > 0, δ ∈ (0, 1), and suppose Assumption 7 holds with some parameter R > 0. Additionally, suppose Assumption 6 holds with the linear optimization oracle denoted by O lin . Then, for any t ∈ [T ] and h ∈ [H], Algorithm 5 returns a point ωt,h that, with probability at least 1δ, satisfies // Define a Set Feasibility Problem using ∆ // 3:
Define the convex set
K ∆ APX ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ω ∈ R d ∑ t-1 i=1 (⟨ω, ϕ(s i , a i )⟩ -r i ) 2 ≤ ∆ + ε |⟨ω, ϕ(s, a)⟩| ≤ 1 + ε for all s, a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭(29) 4:
// Define a Separation Oracle for K ∆ APX using O lin // 

Section: ACKNOWLEDGMENTS
We thank Yuda Song, Zeyu Jia, Noah Golowich, and Sasha Rakhlin for useful discussions. AS acknowledges support from the Simons Foundation and NSF through award DMS-2031883, as well as from the DOE through award DE-SC0022199. WS acknowledges support from NSF IIS-2154711, NSF CAREER 2339395, and DARPA LANCER: LeArning Network CybERagents.

Section: Published as a conference paper at ICLR 2025
The following inequality is well-known; we use the version stated in Zhu & Nowak (2022).
Lemma 22 (Freedman's inequality). Let {X t } t≤T be a real-valued martingale different sequence adapted to the filtration F t , and let
almost surely, then for any η ∈ (0, 1/B), the following holds with probability at least 1δ:
Lemma 23. (Lattimore & Szepesvári, 2020) Assume that Φ ⊆ R d is compact and span(Φ) = R d . For a distribution ρ over Φ, define Λ(ρ) = ∑ ϕ∈Φ ρ(ϕ)ϕϕ ⊺ and g(ρ) = max ϕ∈Φ ∥ϕ∥ 2 Λ(ρ) -1 . Then, the following are equivalent:
Furthermore, there exists a minimizer ρ of g such that |supp(ρ)| ≤ d(d + 1)/2.
Below we show that the Cauchy-Schwarz inequality is still valid when the matrix is not invertible under some conditions. We start with the following lemma.
Lemma 24. Let A be a positive semi-definite matrix. Let B be a square root of A, i.e., A = BB ⊺ . Then range(A) = range(B).
Proof of Lemma 24. We first show that range(A) ⊆ range(B). To see this, for any y ∈ range(A), there exists x such that y = Ax = BB ⊺ x = B(B ⊺ x). Hence y ∈ range(B). Next, we show that range(B) ⊆ range(A). To see this, for any y ∈ range(B), there exists x such that y = Bx. Let x = x 0 + x 1 where x 0 ∈ null(B) and x 1 ∈ rowspace(B). Then, y = Bx = Bx 1 . Since x 1 ∈ rowspace(B), there exists z such that x 1 = B ⊺ z. Thus, y = Bx 1 = BB ⊺ z = Az. Hence, y ∈ range(A).
Lemma 25 (Cauchy-Schwarz under pseudo-inverse). Let Σ be a positive semi-definite matrix (that is unnecessarily invertible). Then, for any x ∈ range(Σ) and any y ∈ R d , we have
Proof of Lemma 25. Let B denote the square root of Σ and force B to be positive semi-definite. One can verify that BB † is the orthogonal projection matrix onto range(B), and hence, range(Σ) (recalling that range(B) = range(Σ) by Lemma 24). Therefore, for any x ∈ range(Σ), we have BB † x = x. Then, we have
where the inequality follows from the standard Cauchy-Schwarz inequality.
Lemma 26 (Invariance under projection). Let Σ ∈ R d×d be a positive semi-definite matrix of rank r.
For any vector ϕ ∈ R d , we have ∥ϕ∥ Σ † = ∥P ϕ∥ Σ † where P is the projection onto the range of Σ.
Proof of Lemma 26. Assume the eigen-decomposition of Σ = U ΛU ⊺ , so Σ † = U Λ † U ⊺ . Without loss of generality, we assume Λ has all its non-zero elements at the front and zero elements at the back on the diagonal. Denote U r as the matrix obtained by replacing the last nr columns of U by 0. Note that the first r columns of U is in the range of Σ, so we must have P U = U r . Then, we have the following
Algorithm 2 Solving Convex Programs by Random Walks (Bertsimas & Vempala (2004))
1: Let T = 2d log( R /δr) and N = O(d log( 1 /δ)) 2: Let D 1 be the axis-aligned cube with width R with center z 1 = 0.
Return z t and terminate.

Section: 9:
else 10:
// If z t ∉ K, shrink the set D t using a separating hyperplane // 11:
Let ⟨a t , z⟩ ≤ b be the separating hyperplane returned by O sep K (z t ).
12:
Let D t+1 ← D t ∩ H t where H t denotes the halfspace {z | ⟨a t , z⟩ ≤ ⟨a t , z t ⟩}.
13:
15:
end if 16: end for 17: Terminate and report that K is empty.
Algorithm 3 SAMPLER used in Algorithm 2 Require: • Convex set D.
• Parameter N .
• Points V = {z 1 , . . . , z N }. (zz)(zz) T .
3: Let U = ∅ and ẑ ∈ V ′ be any arbitrary stating point (note that ẑ ∈ D). 
In the following, we provide a computationally efficient procedure, based off on Algorithm 2, to approximately solve the above squared loss minimization problem given a linear optimization oracle over the feature space (Assumption 6). Note that since r i,h ∈ [0, 1], the constraint on the point ω implies that the objective value in Equation ( 27) is at most 2. Thus, we can solve the above


References:
[b0] Marc Abeille; Alessandro Lazaric (2017). Linear thompson sampling revisited. PMLR
[b1] Alekh Agarwal; Nan Jiang; Wen Sham M Kakade;  Sun (2019). Reinforcement learning: Theory and algorithms. 
[b2] Alekh Agarwal; Yujia Jin; Tong Zhang (2023). Vo q l: Towards optimal regret in model-free rl with nonlinear function approximation. PMLR
[b3] Priyank Agrawal; Jinglin Chen; Nan Jiang (2021). Improved worst-case regret bounds for randomized least-squares value iteration. 
[b4] Bowen Openai ; Marcin Andrychowicz; Maciek Baker; Rafal Chociej; Bob Jozefowicz; Jakub Mcgrew; Arthur Pachocki; Matthias Petron; Glenn Plappert; Alex Powell;  Ray (2020). Learning dexterous in-hand manipulation. The International Journal of Robotics Research
[b5] Mohammad Gheshlaghi Azar; Ian Osband; Rémi Munos (2017). Minimax regret bounds for reinforcement learning. PMLR
[b6] Christopher Berner; Greg Brockman; Brooke Chan; Vicki Cheung; Przemysław Debiak; Christy Dennison; David Farhi; Quirin Fischer; Shariq Hashme; Chris Hesse (2019). Dota 2 with large scale deep reinforcement learning. 
[b7] Dimitris Bertsimas; Santosh Vempala (2004). Solving convex programs by random walks. Journal of the ACM (JACM)
[b8] Rajendra Bhatia (2013). Matrix analysis. Springer Science & Business Media
[b9] Jianyu Chen; Bodi Yuan; Masayoshi Tomizuka (2019). Model-free deep reinforcement learning for urban autonomous driving. IEEE
[b10] Simon Du; Sham Kakade; Jason Lee; Shachar Lovett; Gaurav Mahajan; Wen Sun; Ruosong Wang (2021). Bilinear classes: A structural framework for provable generalization in rl. PMLR
[b11] Jason D Simon S Du; Gaurav Lee; Ruosong Mahajan;  Wang (2020). Agnostic q-learning with function approximation in deterministic systems: Tight bounds on approximation error and sample complexity. 
[b12] Dylan J Foster; M Sham; Jian Kakade; Alexander Qian;  Rakhlin (2021). The statistical complexity of interactive decision making. 
[b13] Noah Golowich; Ankur Moitra (2024). Linear bellman completeness suffices for efficient online reinforcement learning with few actions. PMLR
[b14] David Haussler (2018). Decision theoretic generalizations of the pac model for neural net and other learning applications. CRC Press
[b15] Jiafan He; Heyang Zhao; Dongruo Zhou; Quanquan Gu (2023). Nearly minimax optimal reinforcement learning for linear markov decision processes. PMLR
[b16] Haque Ishfaq; Qiwen Cui; Viet Nguyen; Alex Ayoub; Zhuoran Yang; Zhaoran Wang; Doina Precup; Lin Yang (2021). Randomized exploration in reinforcement learning with general value function approximation. PMLR
[b17] Haque Ishfaq; Qingfeng Lan; Pan Xu; A Rupam Mahmood; Doina Precup; Anima Anandkumar; Kamyar Azizzadenesheli (2023). Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo. 
[b18] Nan Jiang; Akshay Krishnamurthy; Alekh Agarwal; John Langford; Robert E Schapire (2017). Contextual decision processes with low bellman rank are pac-learnable. PMLR
[b19] Chi Jin; Zeyuan Allen-Zhu; Sebastien Bubeck; Michael I Jordan (2018). Is q-learning provably efficient? Advances in neural information processing systems. 
[b20] Chi Jin; Zhuoran Yang; Zhaoran Wang; Michael I Jordan (2020). Provably efficient reinforcement learning with linear function approximation. PMLR
[b21] Chi Jin; Qinghua Liu; Sobhan Miryoosefi (2021). Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems
[b22] Tor Lattimore; Csaba Szepesvári (2020). Bandit algorithms. Cambridge University Press
[b23] Lihong Li; Wei Chu; John Langford; Robert E Schapire (2010). A contextual-bandit approach to personalized news article recommendation. 
[b24] Aditya Modi; Jinglin Chen; Akshay Krishnamurthy; Nan Jiang; Alekh Agarwal (2024). Model-free representation learning and exploration in low-rank mdps. Journal of Machine Learning Research
[b25] Rémi Munos (1999). Error bounds for approximate value iteration. MIT Press
[b26] Ian Osband; Benjamin Van Roy; Zheng Wen (2016). Generalization and exploration via randomized value functions. PMLR
[b27] Daniel Russo; Benjamin Van; Roy  (2013). Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems
[b28] David Silver; Aja Huang; Chris J Maddison; Arthur Guez; Laurent Sifre; George Van Den; Julian Driessche; Ioannis Schrittwieser; Veda Antonoglou; Marc Panneershelvam;  Lanctot (2016). Mastering the game of go with deep neural networks and tree search. nature
[b29] Yuda Song; Yifei Zhou; Ayush Sekhari; Andrew Bagnell; Akshay Krishnamurthy; Wen Sun (2022). Hybrid rl: Using both offline and online data can make rl efficient. 
[b30] Wen Sun; Nan Jiang; Akshay Krishnamurthy; Alekh Agarwal; John Langford (2019). Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. PMLR
[b31] Ruosong Wang; Russ R Salakhutdinov; Lin Yang (2020). Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems

Figures:
Figure fig_0: 
Type: figure
Caption: Lemma 27. (Corollary 42 of Modi et al. (2024)) Given a hypothesis class H ⊆ Z ↦ [a, b] with Pdim(H) ≤ d, then, for any n, we have
Data: 

Figure fig_1: 
Type: figure
Caption: Find a feasible point in K APX // 7: Invoke Algorithm 2 to return a feasible point in the set K APX with O sep KAPX as the separation oracle.optimization problem upto precision ε, by iterating over the set ∆ ∈ {0, ε, 2ε, . . . , 2ε, 2} in order to solve the set feasibility problem , ϕ(s i,h , a i,h )⟩r i,h ) 2 ≤ ∆ + ε |⟨ω, ϕ(s, a)⟩| ≤ 1 + ε for all s, a
Data: 

Figure tab_1: 1
Type: table
Caption: Yuanhao Wang, Ruosong Wang, and Sham Kakade. An exponential lower bound for linearly realizable mdp with constant suboptimality gap. Advances in Neural Information Processing Systems, 34:9521-9533, 2021. Gellért Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. In Algorithmic Learning Theory, pp. 1237-1264. PMLR, 2021. Zheng Wen and Benjamin Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762-782, 2017. Runzhe Wu and Wen Sun. Making rl with preference-based feedback efficient via randomization. In The Twelfth International Conference on Learning Representations, 2024. Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, and Sham M Kakade. The role of coverage in online reinforcement learning. arXiv preprint arXiv:2210.04157, 2022. Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp. 1954-1964. PMLR, 2020a. Other Linear Bellman Completeness Definitions in the Literature . . . . . . . . . . . . 3.2 Other Prior Works on Linear Bellman Completeness . . . . . . . . . . . . . . . . . . . Exploration in the Null Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Main Steps of the Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computationally Efficient Estimation of Reward Function (Eqn. 2) . . . . . . . . . . 39 A TABLE OF NOTATION We list the notation used in this paper in table 1, for the convenience of reference. Notation used in the paper.
Data: Yinglun Zhu and Robert Nowak. Efficient active learning with abstention. arXiv preprintarXiv:2204.00043, 2022.

Figure tab_2: 
Type: table
Caption: Upper bound of ∥ θt,h -T (ω t,h + θ t,h+1 )∥ Σt,h , defined in Lemma 7
Data: optm tOptimism event at round t, defined in Lemma 14U t,hValue function lower bound, defined in Appendix C.2B RB R noiseUpper bound of ∥ξ R t,h ∥ Σ t,h , defined in Definition 5B P noise,hUpper bound of ∥ξ P t,h ∥ Λ t,h , defined in Definition 5B R ϕUpper bound of ∑T t=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h defined in Lemma 16B P ϕUpper bound of ∑T t=1 1{E span tt,h }∥ϕ(s t,h , a t,h )∥ Σ †, defined in Lemma 16B

Figure tab_3: 
Type: table
Caption: Since O(1 + ε 2 ) is a linear function class, which has pseudo-dimension d (Definition 8), we have |C| ≤ (8(1 + ε 2 )e 2 /α)
Data: d(8)by Lemma 27. Now define z

Figure tab_9: 
Type: table
Caption: D.1 PSEUDO DIMENSION AND COVERING NUMBER Definition 6 (ℓ 1 -Covering number). (Definition 4 of Modi et al. (2024)) Given a hypothesis class H ⊆ (Z ↦ R) and Z n = (z 1 , . . . , z n ) ∈ Z n , ε > 0, define N (ε, H, Z n ) as the minimum cardinality of a set C ⊆ H, such that for any h ∈ H, there exists h
Data: 

Figure tab_10: 
Type: table
Caption: ]It is linear in the quadratic feature ϕ(x, u) = [x 2 , u 2 , xu, x, u, 1]. Without loss of generality, we assume P xu = P ⊺ ux . Note that we may not have the Bellman completeness for any such Q. However, it does hold under the restriction that P = [ P xx P xu P ux P uu ] is PSD. Recall that P is PSD if and only if
Data: ⊺[ P ux P uu P xx P xu] [x u ] + c.(16)

Figure tab_11: 
Type: table
Caption: = E w [min u ′ [ = E w [min u ′ {[Ax + Bu + w] T P xx [Ax + Bu + w] + 2 [Ax + Bu + w] T P xu u ′ + u ′T P uu u ′ }] + c.
Data: Ax + Bu + w u ′] ⊺[ P ux P uu P xx P xu] [Ax + Bu + w u ′] + c](18)

Figure tab_12: 
Type: table
Caption: T [P xx -P xu P -1 uu P ux ] [Ax + Bu + w] + c (21) Plugging the above in (19), we getQ(x, u) = E w [[Ax + Bu + w] T [P xx -P xu P -1 uu P ux ] [Ax + Bu + w] + c] (22) = [Ax + Bu] T [P xx -P xu P -1 uu P ux ] [Ax + Bu] + c + Tr((P xx -P xu P -1 uu P ux )Σ) (23) = [Ax + Bu] T [P xx -P xu P -1 uu P ⊺ xu ] [Ax + Bu] + c ′ B T ] (P xx -P xu P -1 uu P ⊺ xu ) [A B] [
Data: (24)= [ u x]x u ] + c ′(25)

Figure tab_13: 
Type: table
Caption: ⟨θ, ϕ(s i , a i )⟩u i ≤ ε for all i ≤ t ⟨θ, ϕ(s i , a i )⟩u i ≥ -ε for all i ≤ t |⟨θ, ϕ(s, a)⟩| ≤ W h + ε for all s, a : // Define a Separation Oracle for the setK APX using O lin // 4: Definition O sep KAPX (Input: parameter θ ∈ R d ) • For all i ≤ t, verify if -ε ≤ ⟨θ, ϕ(s i , a i )⟩u i ≤ ε for all i ≤ t.▸ Output any violating constraint as a separating hyperplane. Terminate. • Then, verify if max{max s,a ⟨θ, ϕ(s, a)⟩, max s,a ⟨-θ, ϕ(s, a)⟩} ≤ W + ε using the linear optimization oracle O lin (Assumption 6). ▸ If violated, use O lin to compute a violating constraint and return it as the separating hyperplane. Terminate. ▸ Otherwise, return that the point θ ∈ K APX . Terminate.
Data: 1: // Convert Square Loss Minimization into a Set Feasibility Problem //2: Define the convex setK APX ∶=⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ θ ∈ R d⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭(26)

Figure tab_14: 
Type: table
Caption: Algorithm 5 Computationally Efficient Implementation of O sq apx for Reward Estimation Require: • Data samples {(s i , a i , r i )} i≤t .• Convex domain O(1).• Approximation parameter ε.• Linear optimization oracle O lin defined in Assumption 6.
Data: t-1t-1∑ω∈O(1)∑

Figure tab_15: 
Type: table
Caption: (⟨ω, ϕ(s i , a i )⟩r i ) 2 ≤ ∆ + ε w.r.t. ω. Terminate. • Then, verify if max{max s,a ⟨ω, ϕ(s, a)⟩, max s,a ⟨-ω, ϕ(s, a)⟩} ≤ 1 + ε using the linear optimization oracle O lin (Assumption 6). ▸ If violated, use O lin to compute a violating constraint and return it as the separating hyperplane. Terminate. ▸ Otherwise, return that the point ω ∈ K ∆ APX . Terminate. If succeeded in finding a feasible point ω ∈ K ∆ APX . Return ω and terminate. • Else, continue.
Data: 5:Definition O sep K ∆▸If not, ∑ t-1return a separating hyperplane for the ellipsoid6:EndDefinition7: 8:// Find a feasible point in K ∆ APX // Invoke Algorithm 2 with O sep APX K ∆ as the separation oracle.•


Formulas:
Formula formula_0: Q π h (s, a) = E π [∑ H i=h r i | s h = s, a h = a]

Formula formula_1: V ⋆ h (s) = max π V π h (s)

Formula formula_2: ⟨T θ, ϕ(s, a)⟩ = E s ′ ∼T(s,a) max a ′ ⟨θ, ϕ(s ′ , a ′ )⟩.

Formula formula_3: [0, 1] with mean r h (s, a) = ⟨ω ⋆ h , ϕ(s, a)⟩ for some unknown ω ⋆ h ∈ R d .

Formula formula_4: Reg T ∶= E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] .

Formula formula_5: 1 ) = ( √ ε, √ p -ε), µ(s 2 ) = (p/ √ ε, 0), and µ(s 3 ) = (0, (1 -p)/ √ p -ε).

Formula formula_6: • Noise variances {σ h } H h=1 and σ R . • A D-optimal design for Φ = {ϕ(s, a) ∶ s ∈ S, a ∈ A} given by {(ϕ i , ρ i )} m i=1 . • Squared loss minimization oracle O sq . 1: Define Σ 1,h ∶= ∑ m i=1 ρ i ϕ i ϕ ⊺ i for all h ∈ [H]. 2: for t = 1, . . . , T do 3:

Formula formula_7: Let Λ t,h ← ∑ m i=1 ρ i (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ ) 8:

Formula formula_8: θt,h ← argmin θ∈O(W h ) t-1 ∑ i=1 (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 (1) ωt,h ← argmin ω∈O(1) t-1 ∑ i=1 (⟨ω, ϕ(s i,h , a i,h )⟩ -r i,h ) 2(2) 10:

Formula formula_9: θ t,h ∼ θt,h + N (0, σ 2 h (I -P t,h )Λ -1 t,h (I -P t,h )) ω t,h ∼ ωt,h + N (0, σ 2 R Σ -1 t,h ) 12: Define Q t,

Formula formula_10: W h = Θ((d √ mH) H-h (d 3/2 + d √ mH

Formula formula_11: σ h = Θ((d √ mH) H-h+1 ( √ d + √ mH)), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T ).

Formula formula_12: E[V ⋆ -V π ] ≤ ε.

Formula formula_13: argmin θ∈O(W ) g(θ) ∶= ∑ (ϕ(s,a),u)∈D (⟨θ, ϕ(s, a)⟩ -u) 2 where O(W ) = {θ ∈ R d | |⟨θ, ϕ(s, a)⟩| ≤ W } for some W ∈ R is a convex set, and D is a dataset of tuples {(ϕ(s, a), u)}. The oracle returns a point θ that satisfies g( θ) -min θ∈O(W ) g(θ) ≤ ε 2 1 and θ ∈ O(W + ε 2 ) where ε 1 , ε 2 ≤ 1 are precision parameters of the oracle.

Formula formula_14: 1 with σ R = Θ( √ dH) and σ h = Θ((d √ mH) H-h+1 (ε 1 γ √ H + √ d + √ mH), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T )).

Formula formula_15: T ∶ R d → R d so that, for all θ ∈ R d and all (s, a) ∈ S × A, it holds that |⟨T θ, ϕ(s, a)⟩ - Es ′ ∼T(s,a) max a ′ ⟨θ, ϕ(s ′ , a ′ )⟩| ≤ ε B . Moreover, we require that, for all h ∈ [H] and (s, a) ∈ S × A, the random reward is bounded in [0, 1] with |r h (s, a) -⟨ω ⋆ h , ϕ(s, a)⟩| ≤ ε B for some unknown ω ⋆ h ∈ R d .

Formula formula_16: σ R = Θ( √ dH + ε B HT ) and σ h = Θ((d √ mH) H-h+1 (ε B γ √ HT + √ ε B T + √ d + √ mH)), we have Reg T = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT )).

Formula formula_17: θt,h ← argmin θ∈O(W h ) t-1 ∑ i=1 (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ,(3)

Formula formula_18: W h = Θ((d √ mH) H-h (ε 1 dγ √ H + d 3/2 + d √ mH)).

Formula formula_19: K ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ θ ∈ R d (⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 = 0 for all i ≤ t |⟨θ, ϕ(s, a)⟩| ≤ W h for all s, a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ . (4

Formula formula_20: )

Formula formula_21: K APX ∶= ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ θ ∈ R d ⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 ) ≤ ε for all i ≤ t ⟨θ, ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 ) ≥ -ε for all i ≤ t |⟨θ, ϕ(s, a)⟩| ≤ W h + ε for all s, a ⎫ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎭ .(5)

Formula formula_22: (θ t,h + R ∞ (ε)) ⊆ K APX .

Formula formula_23: Assumption 7. Let Φ = {ϕ(s, a) | s, a ∈ S × A}. There exist some R ≥ 0 such that 1 R e i ∈ Φ

Formula formula_24: t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ T ε and θt,h ∈ O(W h + ε). Furthermore, Algorithm 4 takes O(d 7 log( R δε )) time in addition to O(d log( T HR δε )) calls to O lin .

Formula formula_25: O(W ) {θ ∈ R d ∶ |⟨θ, ϕ(s, a)⟩| ≤ W for all s ∈ S, a ∈ A} R ∞ (W ) {θ ∈ R d ∶ ∥θ∥ ∞ ≤ W } R 2 (W ) {θ ∈ R d ∶ ∥θ∥ 2 ≤ W } η t,h T (ω t,h+1 + θ t,h+1 ) -θt,h η R t,h ω ⋆ h -ωt,h ξ R t ω t,h -ωt,h ξ P t,h θ t,h -θt,h E high

Formula formula_26: E

Formula formula_27: Σ t,h ∑ m i=1 ρ i ϕ i ϕ ⊺ i + ∑ t-1 i=1 ϕ(s i,h , a i,h )ϕ ⊺ (s i,h , a i,h ) Σt,h ∑ t-1 i=1 ϕ(s i,h , a i,h )ϕ ⊺ (s i,h , a i,h ) W h

Formula formula_28: W h-1 = W h + 2ε 2 + √ 2d ⋅ B P noise,h + √ 2d ⋅ B R noise + 1 with W H+1 = 1

Formula formula_29: E span t ∶= {∀h ∈ [H] ∶ ϕ(s t,h , a t,h ) ∈ span(D t-1,h )} .(6)

Formula formula_30: ∑ t-1 i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 = 0.

Formula formula_31: ∑ T t=1 1{(E span t ) ∁ } ≤ dH.

Formula formula_32: V ⋆ (s t,1 ) -V πt (s t,1 ) = 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 )) + 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))

Formula formula_33: 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 )) = 1{E span t }(V ⋆ (s t,1 ) -U t (s t,1 )) ≤ V ⋆ (s t,1 ) -1{E span t }U t (s t,1 )

Formula formula_34: ≤ E Ṽt [ min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 ) | Ẽoptm t ]

Formula formula_35: = E Ṽt [1{ Ẽspan t }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )) | Ẽoptm t ] + E Ṽt [1{( Ẽspan t ) ∁ }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )) | Ẽoptm t ]

Formula formula_36: ≤ 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t }( min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 ))] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H]

Formula formula_37: = 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t } min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ ∩ E span t }U t (s t,1 )] + 1 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H] ≤ 1 Γ 2 (-1) E Ṽt [1{ Ẽspan t } min{ Ṽt (s t,1 ), H} -1{E span t }U t (s t,1 )] + 2 Γ 2 (-1) E Ṽt [1{( Ẽspan t ) ∁ }H]

Formula formula_38: ≤ 1 Γ 2 (-1) E [1{E span t } min{V t (s t,1 ), H} -1{E span t }U t (s t,1 )] + 2 Γ 2 (-1) E [1{(E span t ) ∁ }H]

Formula formula_39: σ h ≥ √ H( √ 3γB P err + √ 8m(W h + ε 2 )), we have E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T ) + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT )).

Formula formula_40: W h-1 = W h + 2ε 2 + √ 2d ⋅ B P noise,h + √ 2d ⋅ B R noise + 1.

Formula formula_41: W h-1 ≈ d √ mH ⋅ W h + ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 . (7

Formula formula_42: )

Formula formula_43: W h ≈ (d √ mH) H+1-h + (d √ mH) H-h (ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 ) ≈ (d √ mH) H-h (ε 1 ⋅ dγ √ H + ε B ⋅ dγ √ HT + √ ε B ⋅ d √ T + d 3/2 + d √ mH).

Formula formula_44: σ h ≈ (d √ mH) H-h+1 (ε 1 ⋅ γ √ H + ε B ⋅ γ √ HT + √ ε B ⋅ √ T + d 1/2 + √ mH).

Formula formula_45: σ R ≈ √ H( √ d log(HT ) + ε 1 + √ ε B T ). Define Λ = ∑ m i=1 ρ i ϕ i ϕ ⊺ i .

Formula formula_46: • λ ≤ √ d; • λ t,h ≤ √ 2d for all t ∈ [T ] and all h ∈ [H].

Formula formula_47: √ d.

Formula formula_48: x ⊺ Λx = m ∑ i=1 ρ i (x ⊺ ϕ i ) 2 = m ∑ i=1 ρ i (x ⊺ ϕ ∥ t,h,i + x ⊺ ϕ ⊥ t,h,i ) 2 ≤ 2 m ∑ i=1 ρ i (x ⊺ ϕ ∥ t,h,i ) 2 + 2 m ∑ i=1 ρ i (x ⊺ ϕ ⊥ t,h,i ) 2 (using (a + b) 2 ≤ 2a 2 + 2b 2 ) = 2x ⊺ Λ t,h x.

Formula formula_49: ∥ω t,h -ω ⋆ h ∥ Σt ≤ √ 1030(1 + ε 2 ) 4 d log (8(1 + ε 2 )e 2 T 2 H/δ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ).

Formula formula_50: ω ← argmin ω∈O(1) n ∑ i=1 (ω ⊺ ϕ i -r i ) 2

Formula formula_51: ω i = (ω ⊺ ϕ i -r i ) 2 -((ω ⋆ ) ⊺ ϕ i -r i ) 2 . Then we have |z ω i | ≤ 4(1 + ε 2 ) 2

Formula formula_52: E i [z ω i ] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i )(ω ⊺ ϕ i + (ω ⋆ ) ⊺ ϕ i -2r i )] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i + 2((ω ⋆ ) ⊺ ϕ i -r i ))] ≥ (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 -4(1 + ε 2 )ε B ,

Formula formula_53: E i [(z ω i ) 2 ] = E i [(ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 (ω ⊺ ϕ i + (ω ⋆ ) ⊺ ϕ i -2r i ) 2 ] ≤ 16(1 + ε 2 ) 2 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2

Formula formula_54: |z ω i -Ei z ω i | ≤ 8(1 + ε 2 ) 2

Formula formula_55: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 - n ∑ i=1 z ω i ≤ η n ∑ i=1 16(1 + ε 2 ) 2 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 + 8(1 + ε 2 ) 2 log(|C|/δ) η + 4(1 + ε 2 )ε B T.(9)

Formula formula_56: ∑ n i=1 |ω ⊺ ϕ i -ω⊺ ϕ i | ≤ nα.

Formula formula_57: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 n ∑ i=1 (ω ⊺ ϕ i -ω⊺ ϕ i ) 2 + 2 n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 2 n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 , n ∑ i=1 z ω i - n ∑ i=1 z ω i = n ∑ i=1 (ω ⊺ ϕ i -ω⊺ ϕ i )(ω ⊺ ϕ i + ω⊺ ϕ i -2r i ) ≤ 4(1 + ε 2 )nα.

Formula formula_58: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 1 1 -16(1 + ε 2 ) 2 η n ∑ i=1 z ω i + 8(1 + ε 2 ) 2 η(1 -16(1 + ε 2 ) 2 η) ⋅ log(|C|/δ) + 4(1 + ε 2 )ε B T 1 -16(1 + ε 2 ) 2 η . Setting η = (32(1 + ε 2 ) 2 ) -1 , we get n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 n ∑ i=1 z ω i + 512(1 + ε 2 ) 4 log(|C|/δ) + 8(1 + ε 2 )ε B T.

Formula formula_59: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 4 n ∑ i=1 z ω i + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )ε B T. ≤ 2n 2 α 2 + 4 n ∑ i=1 z ω i + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )nα + 16(1 + ε 2 )ε B T.

Formula formula_60: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2n 2 α 2 + 4ε 2 1 + 1024(1 + ε 2 ) 4 log(|C|/δ) + 16(1 + ε 2 )(nα + ε B T ).

Formula formula_61: n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ 2 + 4ε 2 1 + 1024(1 + ε 2 ) 4 d log(8(1 + ε 2 )e 2 n/δ) + 16(1 + ε 2 )(1 + ε B T ) ≤ 1026(1 + ε 2 ) 4 d log(8(1 + ε 2 )e 2 n/δ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ). Finally, we have ∥ω -ω ⋆ h ∥ 2 Σt = n ∑ i=1 (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 + m ∑ i=1 ρ i (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 .

Formula formula_62: m ∑ i=1 ρ i (ω ⊺ ϕ i -(ω ⋆ ) ⊺ ϕ i ) 2 ≤ m ∑ i=1 ρ i ⋅ 4(1 + ε 2 ) = 4(1 + ε 2 ).

Formula formula_63: t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ ε 2 1 + T ε 2 B . Furthermore, ∥ θt,h -T (ω t,h + θ t,h+1 )∥ Σt,h ≤ √ 2ε 2 1 + 4T ε 2 B =∶ B P err .

Formula formula_64: ∈ [t -1] ∶ |⟨ϕ(s i,h , a i,h ), T (ω t,h + θ t,h+1 )⟩ -V t,h+1 (s i,h+1 )| ≤ ε B .

Formula formula_65: t-1 ∑ i=1 (⟨ θt,h , ϕ(s i,h , a i,h )⟩ -V t,h+1 (s i,h+1 )) 2 ≤ ε 2 1 + T ε 2 B .

Formula formula_66: t-1 ∑ i=1 ⟨ϕ(s i,h , a i,h ), θt,h -T (ω t,h + θ t,h+1 )⟩ 2 ≤ 2 t-1 ∑ i=1 (⟨ϕ(s i,h , a i,h ), θt,h ⟩ -V t,h+1 (s i,h+1 )) 2 + 2 t-1 ∑ i=1 (V t,h+1 (s i,h+1 ) -⟨ϕ(s i,h , a i,h ), T (ω t,h + θ t,h+1 )⟩) 2 (using (a + b) 2 ≤ 2a 2 + 2b 2 ) ≤ 2ε 2 1 + 4T ε 2 B . This implies that ∥ θt,h -T (ω t,h + θ t,h+1 )∥ 2 Σt,h ≤ 2ε 2 1 + 4T ε 2 B .

Formula formula_67: E high ∶= {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ξ P t,h ∥ Λ t,h ≤ σ h √ 2d log(6dH 2 T 2 ) =∶ B P noise,h } ∩ {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ξ R t,h ∥ Σ t,h ≤ σ R √ 2d log(6dHT 2 ) =∶ B R noise } ∩ {∀t ∈ [T ], ∀h ∈ [H] ∶ ∥η R t,h ∥ Σ t,h ≤ B R err } where B R err ∶= √ 1030(1 + ε 2 ) 4 d log (24(1 + ε 2 )e 2 T 3 H 2 ) + 4ε 2 1 + 16(1 + ε 2 )(1 + ε B T ).

Formula formula_68: Pr (∀t ∈ [T ], ∀h ∈ [H] ∶ ∥ζ t,h ∥ Λ t,h > σ h √ 2d log(6dH 2 T 2 )) ≤ 1/(3HT ).

Formula formula_69: ∥ξ P t,h ∥ 2 Λ t,h = ∥(1 -P t,h )ζ t,h ∥ 2 Λ t,h = ζ ⊺ t,h (I -P t,h ) m ∑ i=1 (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ )(I -P t,h )ζ t,h = ζ ⊺ t,h m ∑ i=1 ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ ζ t,h ≤ ζ ⊺ t,h m ∑ i=1 (ϕ ∥ t,h,i (ϕ ∥ t,h,i ) ⊺ + ϕ ⊥ t,h,i (ϕ ⊥ t,h,i ) ⊺ )ζ t,h

Formula formula_70: Pr (∀t ∈ [T ] ∶ ∥ξ R t ∥ Σt > σ R √ 2d log(6dHT 2 )) ≤ 1/(3HT ).

Formula formula_71: 1. max s,a |⟨ϕ(s, a), θt,h ⟩| ≤ W h + ε 2 ; 2. max s,a |⟨ϕ(s, a), T (ω t,h + θ t,h+1 )⟩| ≤ W h ; 3. ∥η t,h ∥ Σt,h ≤ B P err ; 4. ∥η t,h ∥ Λ ≤ 2(W h + ε 2 ) √ m; 5. ∥η t,h ∥ Λ t,h ≤ √ 3γB P err + √ 8m(W h + ε 2 ) ; 6. max s,a |⟨ϕ(s, a), θ t,h ⟩| ≤ W h-1 - √ 2d ⋅ B R noise -1 -ε 2 7. max s V t,h (s) = max s,a |Q t,h (s, a)| ≤ W h-1 .

Formula formula_72: |⟨ϕ(s, a), T (ω t,h + θ t,h+1 )⟩| = | E s ′ ∼T(s,a) max a ′ ⟨ϕ(s ′ , a ′ ), ω t,h + θ t,h+1 ⟩| ≤ max s,a |⟨ϕ(s, a), ω t,h + θ t,h+1 ⟩| ≤ max s,a |⟨ϕ(s, a), ωt,h ⟩| + max s,a |⟨ϕ(s, a), ξ R t,h ⟩| + max s,a |⟨ϕ(s, a), θ t,h+1 ⟩| ≤ (1 + ε 2 ) + max s,a ∥ϕ(s, a)∥ Σ -1 t,h ∥ξ R t,h ∥ Σ t,h + (W h - √ 2d ⋅ B R noise -1 -ε 2 ) ≤ 1 + ε 2 + √ 2d ⋅ B R noise + (W h - √ 2d ⋅ B R noise -1 -ε 2 ) = W h .

Formula formula_73: ∥η t,h ∥ Λ = ∥ θt,h -T (ω t,h + θ t,h+1 )∥ Λ ≤ ∥ θt,h ∥ Λ + ∥T (ω t,h + θ t,h+1 )∥ Λ ≤ 2(W h + ε 2 ) √ m.

Formula formula_74: ∥ θt,h ∥ Λ = m ∑ i=1 ⟨ϕ i , θt,h ⟩ 2 ≤ m ∑ i=1 (W h + ε 2 ) 2 = (W h + ε 2 ) √ m

Formula formula_75: ∥η t,h ∥ 2 Λ t,h = m ∑ i=1 ρ i (⟨ϕ ∥ t,h,i , η t,h ⟩ 2 + ⟨ϕ ⊥ t,h,i , η t,h ⟩ 2 ) = m ∑ i=1 ρ i (⟨P t,h ϕ i , η t,h ⟩ 2 + ⟨(I -P t,h )ϕ i , η t,h ⟩ 2 ) ≤ m ∑ i=1 ρ i (3⟨P t,h ϕ i , η t,h ⟩ 2 + 2⟨ϕ i , η t,h ⟩ 2 ) (using (a + b) 2 ≤ a 2 + b 2 ) ≤ 3 m ∑ i=1 ρ i ( ∥ϕ ∥ t,h,i ∥ 2 Σ † t,h ∥η t,h ∥ 2 Σt,h ) + 2∥η t,h ∥ 2 Λ (Cauchy-Schwartz, Lemma 25) We have ∥ϕ ∥ t,h,i ∥ Σ † t,h = ∥P t,h ϕ i ∥ Σ † t,h = ∥ϕ i ∥ Σ †

Formula formula_76: ∥η t,h ∥ 2 Λ t,h ≤ 3γ 2 (B P err ) 2 + 2∥η t,h ∥ 2 Λ ≤ 3γ 2 (B P err ) 2 + 8(W h + ε 2 ) 2 m. (Item 4)

Formula formula_77: max s,a |⟨ϕ(s, a), θ t,h ⟩| = max s,a |⟨ϕ(s, a), θt,h + ξ P t,h ⟩| ≤ max s,a |⟨ϕ(s, a), θt,h ⟩| + max s,a |⟨ϕ(s, a), ξ P t,h ⟩| ≤ W h + ε 2 + max s,a ∥ϕ(s, a)∥ Λ -1 t,h ∥ξ P t,h ∥ Λ t,h ≤ W h + ε 2 + √ 2d ⋅ B P noise,h (Lemma 5) = W h-1 - √ 2d ⋅ B R noise -1 -ε 2 .

Formula formula_78: |Q t,h (s, a)| = |⟨ϕ(s, a), θ t,h ⟩ + ⟨ϕ(s, a), ω t,h ⟩| ≤ |⟨ϕ(s, a), θ t,h ⟩| + |⟨ϕ(s, a), ωt,h ⟩| + |⟨ϕ(s, a), ξ R t ⟩| ≤ (W h-1 - √ 2d ⋅ B R noise -1 -ε 2 ) + (1 + ε 2 ) + √ 2d ⋅ B R noise = W h-1 .

Formula formula_79: s |V t,h (s)| = max s,a |Q t,h (s, a)| ≤ W h-1 .

Formula formula_80: V t,

Formula formula_81: s.t. ∀h ∈ [H] ∶ ∥ ξP t,h ∥ Λ t,h ≤ B P noise,h , ∥ ξR t,h ∥ Σ t,h ≤ B R noise .

Formula formula_82: V πt (s t,1 ) -V t (s t,1 ) = H ∑ h=1 (V t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩); (10) V ⋆ (s t,1 ) -V t (s t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨θ t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩).

Formula formula_83: V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩). (12

Formula formula_84: )

Formula formula_85: V π (s ′ t,1 ) -V t (s ′ t,1 ) = Q π 1 (s ′ t,1 , π(s ′ t,1 )) -max a Q t,1 (s ′ t,1 , a) ≤ Q π 1 (s ′ t,1 , π(s ′ t,1 )) -Q t,1 (s ′ t,1 , π(s ′ t,1 )) (13) = V π 2 (s ′ t,2 ) + r h (s ′ t,1 , a ′ t,1 ) -⟨θ t,1 , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩ -⟨ω t,h , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩ (by definition) = (V π 2 (s ′ t,2 ) -V t,2 (s ′ t,2 )) + (V t,2 (s ′ t,2 ) -⟨θ t,1 , ϕ(s ′ t,1 , π(s ′ t,1 ))⟩) + ⟨ω ⋆ h -ω t,h , ϕ(s ′ t,1 , a ′ t,1 )⟩

Formula formula_86: V π (s ′ t,1 ) -V t (s ′ t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ′ t,h+1 ) -⟨θ t,h , ϕ(s ′ t,h , a ′ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ′ t,h , a ′ t,h )⟩).

Formula formula_87: V πt (s t,1 ) -U t (s t,1 ) = Q πt 1 (s t,1 , π t (s t,1 )) -max a Q t,1 (s t,1 , a) ≤ Q πt 1 (s t,1 , π t (s t,1 )) -Q t,1 (s t,1 , π t (s t,1 )) = V πt 2 (s t,2 ) + r h (s t,1 , a t,1 ) -⟨θ t,1 , ϕ(s t,1 , π t (s t,1 ))⟩ -⟨ω t,h , ϕ(s t,1 , a t,1 )⟩ (by definition) = (V πt 2 (s t,2 ) -U t,2 (s t,2 )) + (U t,2 (s t,2 ) -⟨θ t,1 , ϕ(s t,1 , π t (s t,1 ))⟩) + ⟨ω ⋆ h -ω t,h , ϕ(s t,1 , a t,1 )⟩

Formula formula_88: V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩).

Formula formula_89: V t (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1

Formula formula_90: V t (s t,1 ) -V πt (s t,1 ) = H ∑ h=1 (⟨ θt,h , ϕ(s t,h , a t,h )⟩ + ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ -V t,h+1 (s t,h+1 ) + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩)

Formula formula_91: V t (s t,1 ) -V πt (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ξ P t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩).

Formula formula_92: V t (s t,1 ) = H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ) + ξ P t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω t,h , ϕ(s t,h , a t,h )⟩).

Formula formula_93: V πt (s t,1 ) -U t (s t,1 ) ≤ H ∑ h=1 (U t,h+1 (s t,h+1 ) -⟨θ t,h , ϕ(s t,h , a t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s t,h , a t,h )⟩)

Formula formula_94: U t (s t,1 ) ≥ H ∑ h=1 (⟨ω t,h , ϕ(s t,h , a t,h )⟩ + ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ) + ξ P t,h , ϕ(s t,h , a t,h )⟩).

Formula formula_95: T ∑ t=1 (V t (s t,1 ) -V πt (s t,1 )) ≤ B P err γH + (B R noise + B R err ) ⋅ B R ϕ .

Formula formula_96: T ∑ t=1 (V t (s t,1 ) -V πt (s t,1 )) = T ∑ t=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ω t,h -ω ⋆ h , ϕ(s t,h , a t,h )⟩)

Formula formula_97: ≤ T ∑ t=1 (∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h , ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h

Formula formula_98: ≤ H ⋅ B P err γ + (B R noise + B R err ) ⋅ B R ϕ .

Formula formula_99: |U t (s t,1 )| ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ). Moreover, we have |V t (s t,1 )| ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ).

Formula formula_100: B V ∶= H ⋅ (B R noise + B R err ) ⋅ √ d + H ⋅ (1 + B P err γ).

Formula formula_101: |V t (s t,1 )| ≤ | H ∑ h=1 ⟨ω t,h , ϕ(s t,h , a t,h )⟩| + | H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩| =∶ T 1 + T 2 .

Formula formula_102: T 1 = | H ∑ h=1 ⟨(ω t,h -ωt,h ) + (ω t,h -ω ⋆ ) + ω ⋆ h , ϕ(s t,h , a t,h )⟩| ≤ H ∑ h=1 (∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ) ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h + V πt (Cauchy-Schwartz) ≤ H ⋅ (B R noise + B R err ) ⋅ √ d + H.

Formula formula_103: T 2 = | H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩| ≤ H ∑ h=1 ∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h

Formula formula_104: ≤ B P err γH.

Formula formula_105: U t (s t,1 ) ≥ H ∑ h=1 (⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩ + ⟨ω t,h , ϕ(s t,h , a t,h )⟩) (Lemma 11) ≥ -B P err γH -| H ∑ h=1 ⟨(ω t,h -ωt,h ) + (ω t,h -ω ⋆ h ) + ω ⋆ h , ϕ(s t,h , a t,h )⟩|

Formula formula_106: ≥ -B P err γH - H ∑ h=1 (∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ) ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h (Cauchy-Schwartz) ≥ -B P err γH -H ⋅ (B R noise + B R err ) ⋅ √ d. (Lemma 8)

Formula formula_107: U t (s t,1 ) ≤ E[V t (s t,1 ) | E high ]

Formula formula_108: ≤ B P err γH + H ⋅ (B R noise + B R err ) ⋅ √ d + H.

Formula formula_109: Pr (E optm t ) ≥ Γ 2 (-1)

Formula formula_110: V ⋆ (s t,1 ) -V t (s t,1 ) ≤ H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨θ t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + ⟨ω ⋆ h -ω t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩) = H ∑ h=1 (V t,h+1 (s ⋆ t,h+1 ) -⟨ θt,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩) (i) - H ∑ h=1 ⟨ξ P t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ + H ∑ h=1 ⟨ω ⋆ h -ωt,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ (iii) - H ∑ h=1 ⟨ξ R t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ (iv)

Formula formula_111: V t,h+1 (s ′ ) -⟨ θt,h , ϕ(s, a)⟩ = ⟨T (ω t,h+1 + θ t,h+1 ) -θt,h , ϕ(s, a)⟩ = ⟨η t,h , ϕ(s, a)⟩.

Formula formula_112: (i) -(ii) ≤ H ∑ h=1 ⟨η t,h -ξ P t,h , ϕ(s ⋆ t,h , a ⋆ t,h )⟩ =∶ H ∑ h=1 ⟨η t,h -ξ P t,h , ϕ ⋆ h ⟩

Formula formula_113: (i) -(ii) ≤ H ∑ h=1 ⟨η t,h , P t,h ϕ ⋆ h ⟩ + H ∑ h=1 ⟨η t,h , (I -P t,h )ϕ ⋆ h ⟩ - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ ≤ H ∑ h=1 ∥η t,h ∥ Σt,h ∥P t,h ϕ ⋆ h ∥ Σ † t,h + H ∑ h=1 ∥η t,h ∥ Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ (Cauchy-Schwartz, Lemma 25) ≤ B P err γH + H ∑ h=1 ∥η t,h ∥ Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩

Formula formula_114: ≤ B P err γH + H H ∑ h=1 ∥η t,h ∥ 2 Λ t,h ∥(I -P t,h )ϕ ⋆ h ∥ 2 Λ -1 t,h - H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ (Cauchy-Schwartz)

Formula formula_115: H ∑ h=1 ⟨ξ P t,h , ϕ ⋆ h ⟩ ∼ N (0, H ∑ h=1 σ 2 h ∥(I -P t,h )ϕ ⋆ h ∥ 2 Λ -1 t,h

Formula formula_116: Pr ((i) -(ii) ≤ B P err γH) ≥ Γ(-1).

Formula formula_117: (iii) -(iv) = H ∑ h=1 ⟨ω ⋆ h -ωt,h , ϕ ⋆ h ⟩ - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩ ≤ H ∑ h=1 ∥ω ⋆ h -ωt,h ∥ Σ t,h ∥ϕ ⋆ h ∥ Σ -1 t,h - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩ ≤ H ⋅ H ∑ h=1 ∥ω ⋆ h -ωt,h ∥ 2 Σ t,h ∥ϕ ⋆ h ∥ 2 Σ -1 t,h - H ∑ h=1 ⟨ξ R t,h , ϕ ⋆ h ⟩. Recall that ξ R t is sampled from N (0, σ 2 R Σ -1 t,h

Formula formula_118: H ∑ h=1 ⟨ξ R t , ϕ ⋆ h ⟩ ∼ N (0, H ∑ h=1 σ 2 R ∥ϕ ⋆ h ∥ 2 Σ -1 t,h

Formula formula_119: T ∑ t=1 1 {(E span t ) ∁ } ≤ dH.

Formula formula_120: T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ d √ 2T log(T + 1) =∶ B R ϕ , T ∑ t=1 1{E span t }∥ϕ(s t,h , a t,h )∥ Σ † t,h ≤ γd √ 2dT log (2T γ 2 ) =∶ B P ϕ .

Formula formula_121: ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ ∥ϕ(s t,h , a t,h )∥ Λ -1 ≤ √ d.

Formula formula_122: T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ T ⋅ T ∑ t=1 ∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h = T ⋅ T ∑ t=1 min {∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h , d} ≤ T d ⋅ T ∑ t=1 min {∥ϕ(s t,h , a t,h )∥ 2 Σ -1 t,h , 1} ≤ √ T d ⋅ 2d log(T + 1) (elliptical potential lemma, Lemma 21) = d √ 2T log(T + 1).

Formula formula_123: ∥ϕ(s t,h , a t,h )∥ 2 Σ † t,h = ϕ(s t,h , a t,h ) ⊺ Σ † t,h ϕ(s t,h , a t,h ) = ϕ(s t,h , a t,h ) ⊺ ⎛ ⎝ Σti,h + t-1 ∑ j=ti ϕ(s j,h , a j,h )ϕ ⊺ (s j,h , a j,h ) ⎞ ⎠ † ϕ(s t,h , a t,h ) = x ⊺ t U ⊺ ⎛ ⎝ U DU ⊺ + t-1 ∑ j=ti U x j x ⊺ j U ⊺ ⎞ ⎠ † U x t = x ⊺ t ⎛ ⎝ D + t-1 ∑ j=ti x j x ⊺ j ⎞ ⎠ -1 x t . Define D t = D + ∑ t-1 j=ti x j x ⊺ j . Hence, we have ti+1-1 ∑ t=ti 1{E span t }∥ϕ(s t,h , a t,h )∥ Σ † t,h = ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ D -1 t .

Formula formula_124: ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ D -1 t ≤ T ⋅ ti+1-1 ∑ t=ti 1{E span t }∥x t ∥ 2 D -1 t = T ⋅ ti+1-1 ∑ t=ti 1{E span t } min {∥x t ∥ 2 D -1 t , γ 2 } ≤ γ T ⋅ ti+1-1 ∑ t=ti 1{E span t } min {∥x t ∥ 2 D -1 t , 1} ≤ γ √ T ⋅ 2d log (T γ 2 (1 + 1/d)) (elliptical potential lemma, Lemma 21) ≤ γ √ T ⋅ 2d log (2T γ 2 ).

Formula formula_125: E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [1{(E high ) ∁ } T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [ T ∑ t=1 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))]

Formula formula_126: E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ 1 Γ 2 (-1) E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] + 1 Γ 2 (-1) ⋅ (dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ + dH 2 + 1)

Formula formula_127: E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 (V ⋆ (s t,

Formula formula_128: ≤ E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt (s t,1 ) -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } (min{H, Ṽt (s t,1 )} -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } (min{H, Ṽt (s t,1 )} -1{E span t }V πt (s t,1 )) | Ẽoptm t ]] + B P err γH =∶ T 1 + T 2 + T 3 + B P

Formula formula_129: T1 = E ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1{E high } T ∑ t=1 E Ṽt ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1)) + 1{(E span t ) ∁ } ⋅ B V ( * ) Ẽoptm t ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ + E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t }1{E span t }(Ut(st,1) -V π t (st,1)) | Ẽoptm t ]] -E [1{E high } T ∑ t=1 E Ṽt [1{(E span t ) ∁ } ⋅ B V | Ẽoptm t ]] =∶ T1.1 + T1.2 + T1.3.

Formula formula_130: E[X | E] = E[X ⋅ 1{E}]/ Pr(E) ≤ E[X]/ Pr(E)

Formula formula_131: T1.1 ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1)) + 1{(E span t ) ∁ } ⋅ B V ]] = 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1))]] + 1 Γ 2 (-1) E [1{E high } T ∑ t=1 1{(E span t ) ∁ } ⋅ B V ] ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt(st,1) -1{E span t }Ut(st,1))]] + 1 Γ 2 (-1) ⋅ dHB V (Lemma 15)

Formula formula_132: T 1.2 ≤ E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t }1{E span t }(V t (s t,1 ) -V πt (s t,1 )) | Ẽoptm t ]] (V t ≥ U t conditioning on E high ) ≤ B P err γH + (B R noise + B R err ) ⋅ B R ϕ .

Formula formula_133: T 1 ≤ 1 Γ 2 (-1) E [1{E high } T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } ( Ṽt (s t,1 ) -1{E span t }U t (s t,1 ))]] + 1 Γ 2 (-1) ⋅ dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ = 1 Γ 2 (-1) E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] + 1 Γ 2 (-1) ⋅ dHB V + B P err γH + (B R noise + B R err ) ⋅ B R ϕ

Formula formula_134: T 2 ≤ H ⋅ E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } | Ẽoptm t ]] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽhigh ) ∁ } | Ẽoptm t ]] (dropping E high ) = H ⋅ E [ T ∑ t=1 Pr (( Ẽhigh ) ∁ ∩ Ẽoptm t ) / Pr ( Ẽoptm t )] ≤ HT Γ 2 (-1) ⋅ Pr ((E high ) ∁ ) ≤ 1 Γ 2 (-1)

Formula formula_135: T 3 ≤ H ⋅ E [1{E high } T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } | Ẽoptm t ]] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ } | Ẽoptm t ]] (dropping E high ) = H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ ∩ Ẽoptm t }] / Pr( Ẽoptm t )] ≤ H ⋅ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ }] / Pr( Ẽoptm t )] ≤ H Γ 2 (-1) ⋅ E [ T ∑ t=1 1{(E span t ) ∁ }] (tower rule) ≤ dH 2 Γ 2 (-1) (Lemma 15)

Formula formula_136: E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] + dHB V + 2B V /H.

Formula formula_137: E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] = E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] -E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ (E high ) ∁ } Ṽt (s t,1 )]]

Formula formula_138: -E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ (E high ) ∁ } Ṽt (s t,1 )]] ≤ E [ T ∑ t=1 1{(E high ) ∁ }B V ] ≤ B V /H

Formula formula_139: 1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 ) ≥ 1{E span t ∩ E high }U t (s t,1 ) -1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 ) -1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 ).

Formula formula_140: E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t ∩ E high } Ṽt (s t,1 ) -1{E span t ∩ E high ∩ Ẽhigh ∩ Ẽspan t }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] + E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 )]] + E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 )]] + B V /H

Formula formula_141: E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽhigh ) ∁ }B V ]] (Lemma 13) ≤ T ⋅ Pr(( Ẽhigh ) ∁ )B V ≤ B V /H (Lemma 8)

Formula formula_142: E [ T ∑ t=1 E Ṽt [1{E span t ∩ E high ∩ ( Ẽspan t ) ∁ }U t (s t,1 )]] ≤ E [ T ∑ t=1 E Ṽt [1{( Ẽspan t ) ∁ }]B V ] (Lemma 13) = B V E [ T ∑ t=1 1{(E span t ) ∁ }] (tower rule) ≤ dHB V . (Lemma 15)

Formula formula_143: E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] ≤ 2HB P err B P ϕ + 2(B R err + B R noise ) ⋅ HB R ϕ .

Formula formula_144: E [ T ∑ t=1 E Ṽt [1{ Ẽhigh ∩ Ẽspan t } Ṽt (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]] = E [ T ∑ t=1 1{E high ∩ E span t }V t (s t,1 ) -1{E span t ∩ E high }U t (s t,1 )]

Formula formula_145: ≤ E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨ θt,h -T (θ t,h+1 + ω t,h+1 ), ϕ(s t,h , a t,h )⟩] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨T (θ t,h+1 + ω t,h+1 ) -θt,h , ϕ(s t,h , a t,h )⟩] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ⟨ω t,h -ω t,h , ϕ(s t,h , a t,h )⟩]

Formula formula_146: ≤ E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ∥ θt,h -T (θ t,h+1 + ω t,h+1 )∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h ] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 ∥T (θ t,h+1 + ω t,h+1 ) -θt,h ∥ Σt,h ∥ϕ(s t,h , a t,h )∥ Σ † t,h ] + E [ T ∑ t=1 1{E high ∩ E span t } H ∑ h=1 (∥ω t,h -ω ⋆ h ∥ Σ t,h + ∥ω ⋆ h -ω t,h ∥ Σ t,h )∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ]

Formula formula_147: ∥ω t,h -ω ⋆ h ∥ Σ t,h ≤ ∥ω t,h -ωt,h ∥ Σ t,h + ∥ω t,h -ω ⋆ h ∥ Σ t,h ≤ B R err + B R noise

Formula formula_148: T ∑ t=1 H ∑ h=1 ∥ϕ(s t,h , a t,h )∥ Σ -1 t,h ≤ HB R ϕ .

Formula formula_149: 2HB P err B P ϕ + 2(B R err + B R noise ) ⋅ HB R ϕ .

Formula formula_150: E [ T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] ≤ E [1{E high } T ∑ t=1 1{E span t }(V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [1{(E high ) ∁ } T ∑ t=1 (V ⋆ (s t,1 ) -V πt (s t,1 ))] + E [ T ∑ t=1 1{(E span t ) ∁ }(V ⋆ (s t,1 ) -V πt (s t,1 ))] =∶ T A + T B + T C .

Formula formula_151: T A ≤ 1 Γ 2 (-1) ⋅ (2B V (dH + 1/H) + HB P err γ + dH 2 + 1 + (B R err + B R noise )(2H + 1)B R ϕ + 2HB P err B P ϕ ) = Õ(d 5/2 H 5/2 + d 2 H 3/2 √ T + ε 1 γ(dH 2 + d 3/2 H √ T ) + √ ε B (d 2 H 5/2 √ T + d 3/2 H 3/2 T ) + ε B γ(dH 2 √ T + d 3/2 HT ))

Formula formula_152: T B ≤ HT ⋅ Pr ((E high ) ∁ ) ≤ 1.

Formula formula_153: T C ≤ H ⋅ E [ T ∑ t=1 1{(E span t ) ∁ }] ≤ dH 2 .

Formula formula_154: T ∑ t=1 min {1, x ⊺ t Σ -1 t x t } ≤ 2d log ( b a + T ad ) .

Formula formula_155: T ∑ t=1 min {1, x ⊺ t Σ -1 t x t } ≤ 2d log (T + 1) .

Formula formula_156: min {1, x ⊺ t Σ -1 t x t } ≤ 2x ⊺ t Σ -1 t+1 x t(14)

Formula formula_157: x ⊺ t Σ -1 t+1 x t = x ⊺ t (Σ t + x t x ⊺ t ) -1 x t = x ⊺ t ⎛ ⎝ Σ -1 t - Σ -1 t x t x ⊺ t Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t ⎞ ⎠ x t = ∥x t ∥ 2 Σ -1 t - ∥x t ∥ 4 Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t = ∥x t ∥ 2 Σ -1 t 1 + ∥x t ∥ 2 Σ -1 t .

Formula formula_158: t ∥ 2 Σ -1 t /2. Case 2 : x ⊺ t Σ -1 t x t ≥ 1.

Formula formula_159: x ⊺ t Σ -1 t+1 x t ≥ min {1, x ⊺ t Σ -1 t x t } /2

Formula formula_160: T ∑ t=1 x ⊺ t Σ -1 t+1 x t = T ∑ t=1 tr (Σ -1 t+1 (Σ t+1 -Σ t )) ≤ T ∑ t=1 (log det Σ t+1 -log det Σ t ) = log ( det Σ T +1 det Σ 1 )

Formula formula_161: det (Σ 1 + T ∑ t=1 x t x ⊺ t ) ≤ d ∏ i=1 (b + λ i ) ≤ ( 1 d d ∑ i=1 (b + λ i )) d ≤ (b + 1 d tr ( T ∑ t=1 x t x ⊺ t )) d ≤ (b + T d ) d

Formula formula_162: log ( det Σ T +1 det Σ 1 ) ≤ d log ( b a + T ad ) .

Formula formula_163: Σ T +1 = E x∼ρ xx ⊺ + T ∑ t=1 x t x ⊺ t = (T + 1) ( 1 1 + T ⋅ E x∼ρ xx ⊺ + T ∑ t=1 1 1 + T ⋅ x t x ⊺ t ) ( * ) =∶ (T + 1) E x∼ρ ′ xx ⊺

Formula formula_164: log ( det Σ T +1 det Σ 1 ) = log ( (T + 1) d det Ex∼ρ ′ xx ⊺ det Σ 1 ) ≤ log ((T + 1) d ) = d log (T + 1) .

Formula formula_165: ′ ∈ C such that ∑ n i=1 |h(z i ) -h ′ (z i )|/n ≤ ε. We define N (ε, H, n) = max Z n ∈Z n N (ε, H, Z n ).

Formula formula_166: H + = {(x, ξ) ↦ 1[h(x) > ξ] ∶ h ∈ H} ⊆ (X × R → {0, 1})

Formula formula_167: N (ε, H, n) ≤ (4e 2 (b -a)/ε) d .

Formula formula_168: J(π) = E [ H ∑ h=1 x ⊺ h Qx h + u ⊺ h Ru h ]

Formula formula_170: Q(x, u) = [ x u

Formula formula_171: Q(x, u) = E x ′ [min u ′ Q(x ′ , u ′ )](17)

Formula formula_174: min u ′ Q(x ′ , u ′ ) = [Ax + Bu + w]

Formula formula_175: t-1 ∑ i=1 (⟨ω, ϕ(s i,h , a i,h )⟩ -r i,h ) 2 ≤ ∆ + ε.

Formula formula_176: K ∆ APX ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ω ∈ R d ∑ t-1 i=1 (⟨ω, ϕ(s i , a i )⟩ -r i ) 2 ≤ ∆ + ε |⟨ω, ϕ(s, a)⟩| ≤ 1 + ε for all s, a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭(29) 4:
