['13,15c13', '< Computational Efficient RL under Linear Bellman Completeness. Numerous works have focused on computationally efficient RL within the scope of linear Bellman completeness (LBC). The simplest setting is tabular MDPs where computationally efficient and near-optimal algorithms have been well known (Azar et al., 2017;Zhang et al., 2020;Jin et al., 2018). Tabular MDPs can be extended to linear MDPs (Jin et al., 2020), where computationally efficient algorithms are also known (Jin et al., 2020;Agarwal et al., 2023;He et al., 2023). However, in the setting of linear Bellman completeness, which captures linear MDPs, the existence of computationally efficient algorithms remain unclear. Previous works have resorted to various assumptions to achieve computational efficiency, such as few actions (Golowich & Moitra, 2024) and assuming MDPs are "explorable" (Zanette et al., 2020c). We provide a detailed overview of the literature in Section 3.2.', '< Exploration via Randomization. Random noise has been a powerful alternative to bonus-based exploration in RL literature. A typical approach is Randomized Least-Squares Value Iteration (RLSVI) (Osband et al., 2016), which injects Gaussian noise into the least-squares estimate and achieves near-optimal worst-case regret for linear MDPs (Agrawal et al., 2021;Zanette et al., 2020a); Ishfaq et al. (2023) instead propose posterior sampling via Langevin Monte Carlo for Q-function and also obtain regret bounds for linear MDPs; Ishfaq et al. (2021) developed randomization algorithms for general function approximation assuming bounded eluder dimension and Bellman completeness for any function. Randomization is also explored in preference-based RL, leading to the first computationally efficient algorithm with near-optimal regret guarantees for linear MDPs (Wu & Sun, 2024). However, these approaches either have strong assumptions (e.g., Bellman completeness for any function), or inject random noise larger than the estimation error, causing exponential blowup of parameter values-to mitigate it, they truncate the value, but this is feasible only in low-rank MDPs and challenging under linear Bellman completeness as the Bellman backup of truncated value may no longer be linear. Consequently, existing algorithms cannot handle linear Bellman complete problems, and new techniques capable of managing exponential parameter values are needed.', '< Beyond Linear Bellman Completeness. Many structural conditions capture linear Bellman completeness, such as Bilinear class (Du et al., 2021), Bellman eluder dimension (Jin et al., 2021), Bellman rank (Jiang et al., 2017), witness rank (Sun et al., 2019), and decision-estimation coefficient (Foster et al., 2021). While statistically efficient algorithms exist for these settings, no computationally efficient algorithms are known.', '---', '> Our work builds upon and differentiates itself from several lines of research in computationally efficient Reinforcement Learning (RL) and exploration strategies. We categorize the most relevant prior efforts below.', '16a15,20', '> Computational Efficient RL under Linear Bellman Completeness. The pursuit of computationally efficient RL algorithms within the linear Bellman completeness (LBC) setting has been a significant area of focus. For basic tabular MDPs, efficient and near-optimal algorithms are well-established (Azar et al., 2017; Zhang et al., 2020; Jin et al., 2018). These results extend to linear MDPs (Jin et al., 2020), where computationally efficient algorithms, such as LSVI-UCB, are also known (Jin et al., 2020; Agarwal et al., 2023; He et al., 2023). However, the broader LBC setting, which subsumes linear MDPs but lacks certain restrictive assumptions (e.g., on Bellman operator norm-boundedness or feature representation properties), presents a greater challenge. The existence of computationally efficient algorithms for general LBC has remained an open question. Previous works have often relied on strong simplifying assumptions to achieve computational efficiency, such as a limited number of actions (Golowich & Moitra, 2024) or "explorable" MDPs (Zanette et al., 2020c), which do not hold universally. Our approach specifically targets the LBC setting with deterministic dynamics but without these strong assumptions, addressing a previously unmet need. We provide a more detailed overview of the literature and its limitations in Section 3.2.', '> ', '> Exploration via Randomization. Random noise injection has emerged as a powerful alternative to bonus-based exploration in RL. Randomized Least-Squares Value Iteration (RLSVI) (Osband et al., 2016) is a prominent example, injecting Gaussian noise into least-squares estimates to achieve near-optimal worst-case regret for linear MDPs (Agrawal et al., 2021; Zanette et al., 2020a). Other randomization techniques include posterior sampling via Langevin Monte Carlo for Q-functions (Ishfaq et al., 2023) and methods for general function approximation with bounded eluder dimension and Bellman completeness (Ishfaq et al., 2021). Randomization is also explored in preference-based RL (Wu & Sun, 2024). However, a critical limitation of these prior randomization methods in the LBC setting is their tendency to cause exponential blow-up of parameter values. This occurs because the noise injected is often larger than the estimation error, and without strong norm-boundedness assumptions (which we do not make), the parameters can grow unboundedly with the horizon. While truncation is a common mitigation strategy, it is only feasible in low-rank MDPs and fundamentally incompatible with linear Bellman completeness, as the Bellman backup of a truncated value function is generally no longer linear. Our work introduces a novel null-space randomization technique that explicitly addresses this parameter blow-up issue, making randomization viable for LBC problems.', '> ', '> Beyond Linear Bellman Completeness. The LBC setting itself is a specific structural condition. Broader structural conditions like Bilinear classes (Du et al., 2021), Bellman eluder dimension (Jin et al., 2021), Bellman rank (Jiang et al., 2017), witness rank (Sun et al., 2019), and decision-estimation coefficients (Foster et al., 2021) have also been explored. While these conditions have led to statistically efficient algorithms, computationally efficient counterparts often remain unknown, further highlighting the general challenge of bridging the statistical-computational gap in RL with rich function approximation.', '> ', '46,52c50,58', '< In this section, we review prior efforts on RL under linear Bellman completeness and discuss various assumptions underlying these approaches.', '< Efficient Algorithms under Generative Access. A generative model takes as input a state-action pair (s, a) and returns a sample s ′ ∼ T(⋅ | s, a) and the reward signal. With such a generative model, Linear Least-Squares Value Iteration (LSVI) can achieve statistical and computational efficiency (Agarwal et al., 2019). However, generative access is a big assumption, and our work aims to operate with only online access.', '< Efficient Algorithms under Explorability Assumption. Zanette et al. (2020c) propose a rewardfree algorithm under the assumption that every direction in the parameter space is reachable. This assumption, when translated into tabular MDPs, means that any state can be reached with a probability bounded below by some (large enough) positive constant. This does not hold if there are unreachable states or if the probability of reaching them is exponentially small. Computationally Intractable Algorithms. Zanette et al. (2020b) present a computationally intractable algorithm that requires solving an intractable optimization problem. In our work, we aim to only utilize a tractable squared loss minimization oracle.', '< Few action MDPs. Golowich & Moitra (2024) propose a computationally efficient algorithm under linear Bellman completeness, inspired by the bonus-based exploration approach in LSVI-UCB (Jin et al., 2020) for Linear MDPs. While their algorithm extends to stochastic MDPs, both the sample complexity and running time have exponential dependence on the size of the action space. In comparison, our algorithm extends to infinite action spaces but relies on the transition dynamics to be deterministic.', '< Deterministic Rewards or Deterministic Initial State. Several existing studies provide computationally and statistically efficient algorithms for more general settings but under stronger assumptions; these methods can be extended to linear Bellman completeness settings but similarly strong assumptions will also apply. Du et al. (2020) provide an algorithm based on a span argument that is efficient for MDPs that have linear optimal state-action value function (a.k.a. the Linear Q ⋆ setting), deterministic transition dynamics, deterministic initial state, and stochastic rewards. Unfortunately, their approach cannot extend to settings with stochastic initial states, as we consider in our paper. Another line of work due to Wen & Van Roy (2017) considers the Q ⋆ -realizable setting with deterministic dynamics, deterministic rewards, stochastic initial states, and bounded eluder dimension. Their approach can be extended to the linear bellman completeness setting when both rewards and dynamics are deterministic. However, their algorithm fails to converge when rewards are stochastic and thus may not apply to the problem setting that we consider.', '< Efficient Algorithm in the hybrid RL setting. Song et al. (2022) develop efficient algorithms for the hybrid RL setting, where the learner has access to both online interaction and an offline dataset. However, they do not have a fully online algorithms.', '< In summary, no previous work addresses the problem with stochastic initial states, stochastic rewards, and large action spaces. This is the gap that we aim to fill with this work.', '---', '> This section provides a comprehensive review of prior research on RL within the linear Bellman completeness framework, critically examining the assumptions and limitations of existing approaches. Our analysis highlights the specific challenges that our proposed algorithm overcomes.', '> ', "> Efficient Algorithms under Generative Access. Algorithms that assume access to a generative model, which can provide samples of the next state (s' ~ T(⋅ | s, a)) and reward signal for any given state-action pair (s, a), have achieved statistical and computational efficiency. Linear Least-Squares Value Iteration (LSVI) is a prime example within this category (Agarwal et al., 2019). However, the assumption of generative access is often unrealistic in practical online RL scenarios. Our work specifically focuses on the more challenging online access setting, where the agent interacts directly with the environment.", '> ', '> Efficient Algorithms under Explorability Assumption. Some prior works, such as Zanette et al. (2020c), propose reward-free algorithms contingent on an "explorability" assumption. This assumption posits that every direction in the parameter space is reachable, which, in tabular MDPs, implies that any state can be reached with a sufficiently high probability. This condition is restrictive and does not hold in environments with unreachable states or where reaching certain states has exponentially small probabilities. Our algorithm operates without such strong assumptions on environmental explorability.', '> ', '> Computationally Intractable Algorithms. A notable portion of the literature, including Zanette et al. (2020b), presents statistically efficient algorithms that are unfortunately computationally intractable, often requiring the solution of complex, non-convex optimization problems. A core design principle of our work is to rely solely on tractable squared loss minimization oracles, ensuring computational feasibility.', '> ', '> Few-Action MDPs. Golowich & Moitra (2024) introduced a computationally efficient algorithm for linear Bellman completeness, extending bonus-based exploration from LSVI-UCB (Jin et al., 2020) for Linear MDPs. While their method handles stochastic MDPs, both its sample complexity and running time exhibit an exponential dependence on the size of the action space. This makes it impractical for problems with large or continuous action spaces. In contrast, our algorithm is designed to scale efficiently to infinite action spaces, albeit under the assumption of deterministic transition dynamics.', '53a60,65', '> Deterministic Rewards or Deterministic Initial State. Several studies have developed computationally and statistically efficient algorithms for more general settings by imposing strong assumptions on either the reward function or the initial state distribution. Du et al. (2020) present an efficient algorithm for the Linear Q⋆ setting with deterministic transitions, deterministic initial states, and stochastic rewards, leveraging a span argument. However, their approach cannot be directly extended to scenarios with stochastic initial states, which are explicitly considered in our paper. Another line of work by Wen & Van Roy (2017) addresses the Q⋆-realizable setting with deterministic dynamics, deterministic rewards, stochastic initial states, and bounded eluder dimension. While extendable to linear Bellman completeness when both rewards and dynamics are deterministic, their algorithm struggles with stochastic rewards, thus limiting its applicability to our problem setting.', '> ', '> Efficient Algorithms in Hybrid RL. Song et al. (2022) explore efficient algorithms in a hybrid RL setting, where the learner benefits from both online interaction and an existing offline dataset. While valuable, their work does not provide a fully online algorithm, which is the primary focus of our research.', '> ', '> In summary, despite significant prior work, a computationally efficient online RL algorithm for the linear Bellman complete setting that simultaneously accommodates stochastic initial states, stochastic rewards, and large (or infinite) action spaces, while only requiring deterministic transition dynamics, remained an open problem. This paper directly addresses and fills this critical gap.', '> ', '1033d1044', '< ']
