Harnessing Bayesian Optimism with Dual Policies in Reinforcement Learning

Harnessing Bayesian Optimism with Dual Policies in Reinforcement Learning

ICLR 2026 Conference Submission17364 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, exploration-exploitation trade-off

TL;DR: We propose to use two policies to address the exploration-exploitation tradeoff by leaveraging Bayesian principles.

Abstract: Deep reinforcement learning (RL) algorithms for continuous control tasks often struggle with a trade-off between exploration and exploitation. The exploitation objective of a RL policy is to approximate the optimal strategy that maximises the expected cumulative return based on its current beliefs of the environment. However, the same policy must also concurrently perform exploration to gather new samples which are essential for refining the underlying function approximators. Contemporary RL algorithms often entrust a single policy with both behaviours. However, these two behaviours are not always aligned; tasking a single policy with this dual mandate may lead to a suboptimal compromise, resulting in inefficient exploration or hesitant exploitation. Whilst state-of-the-art methods focus on alleviating this trade-off between exploration and exploitation to prevent catastrophic failures, they may inadvertently sacrifice the potential benefits of optimism that drives exploration. To address this challenge, we propose a new algorithm based on training two distinct policies to disentangle exploration and exploitation for continuous control and aims to strike a balance between robust exploration and exploitation. The first policy is trained to explore the environment more optimistically, maximising the upper confidence bound (UCB) of the expected return, with the uncertainty estimates for the bound derived from an approximate Bayesian framework. Concurrently, the second policy is trained for exploitation with conservative value estimates based on established value estimation techniques. We empirically verify that our proposed algorithm, combined with TD3 or SAC, significantly outperforms existing approaches across various benchmark tasks, demonstrating improved performance.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 17364

Loading