Multi-Level Regression for Nonlinear Contextual Bandits and RL: Second-order and Horizon-free Regret Bounds

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Bandit, Reinforcement Learning
Abstract: Recent works have established second-order regret bounds for nonlinear contextual bandits. However, these results exhibit a suboptimal dependence on the complexity of the function class. To close this gap, we propose a novel algorithm featuring a multi-level regression structure. This method partitions data by their uncertainty and variance, then performs separate regressions on each level, enabling adaptive, instance-dependent learning. Our method achieves a tight second-order regret bound of $\tilde{O}\Big(\sqrt{d_\mathcal{F} \log N_\mathcal{F} \sum_{t\in[T]} \sigma_t^2} + R d_\mathcal{F} \log N_\mathcal{F}\Big)$, which matches the theoretical lower bound. Here, $d_\mathcal{F}$ and $\log N_\mathcal{F}$ represent the Eluder dimension and log-covering number of the reward function class $\mathcal{F}$, $\sigma_t^2$ is the unknown variance of the reward at round $t$, and $R$ is the range of rewards. The proposed algorithm is computationally efficient assuming access to a regression oracle. We further extend our framework to model-based reinforcement learning, achieving a regret bound that is both second-order and horizon-free. The underlying multi-level regression technique is of independent interest and applicable to a broad range of online decision-making problems.
Primary Area: reinforcement learning
Submission Number: 8535
Loading