$\beta$-DQN: Diverse Exploration via Learning a Behavior Function

Hongming Zhang; Fengshuo Bai; Chenjun Xiao; Chao Gao; bo xu; Martin Müller

$\beta$-DQN: Diverse Exploration via Learning a Behavior Function

Hongming Zhang, Fengshuo Bai, Chenjun Xiao, Chao Gao, bo xu, Martin Müller

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Deep Reinforcement Learning, Exploration

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Diverse exploration while maintaining simplicity, generality and computational efficiency.

Abstract: Efficient exploration remains a pivotal challenge in reinforcement learning (RL). While numerous methods have been proposed, their lack of simplicity, generality and computational efficiency often lead researchers to choose simple techniques such as $\epsilon$-greedy. Motivated by these considerations, we propose $\beta$-DQN. This method improves exploration by constructing a set of diverse polices through a behavior function $\beta$ learned from the replay memory. First, $\beta$ differentiates actions based on their frequency at each state, which can be used to design strategies for better state coverage. Second, we constrain temporal difference (TD) learning to in-sample data and derive two functions $Q$ and $Q_{\textit{mask}}$. Function $Q$ may overestimate unseen actions, providing a foundation for bias correction exploration. $Q_{\textit{mask}}$ reduces the values of unseen actions in $Q$ using $\beta$ as an action mask, thus yields a greedy policy that purely exploit in-sample data. We combine $\beta, Q, Q_{\textit{mask}}$ to construct a set of policies ranging from exploration to exploitation. Then an adaptive meta-controller selects an effective policy for each episode. $\beta$-DQN is straightforward to implement, imposes minimal hyper-parameter tuning demands, and adds a modest computational overhead to DQN. Our experiments, conducted on simple and challenging exploration domains, demonstrate $\beta$-DQN significantly enhances performance and exhibits broad applicability across a wide range of tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7079

Loading