Anti-Concentrated Confidence Bonuses For Scalable Exploration

Jordan T. Ash; Cyril Zhang; Surbhi Goel; Akshay Krishnamurthy; Sham M. Kakade

Anti-Concentrated Confidence Bonuses For Scalable Exploration

Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham M. Kakade

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 PosterReaders: Everyone

Keywords: deep reinforcement learning, reinforcement learning, bandits, exploration

Abstract: Intrinsic rewards play a central role in handling the exploration-exploitation tradeoff when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce anti-concentrated confidence bounds for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain $\tilde O(d \sqrt{T})$ regret bounds for $\mathsf{poly}(d)$ fixed actions. We develop a practical variant that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/anti-concentrated-confidence-bonuses-for/code)

9 Replies

Loading