Systematic Exploration and Exploitation via a Markov Game with Impulse Control

TMLR Paper2283 Authors

22 Feb 2024 (modified: 14 Mar 2024)Under review for TMLREveryoneRevisionsBibTeX
Abstract: Efficient reinforcement learning (RL) involves a trade-off between ``exploitative'' actions that maximise expected reward and ``explorative'' actions that lead to the visitation of ``novel'' states. To encourage exploration, existing methods proposed methods such as injecting stochasticity into action selection, implicit regularisation, and synthetic heuristic rewards. However, these techniques do not necessarily offer systematic approach for making this trade-off. Here we introduce \textbf{SE}lective \textbf{R}einforcement \textbf{E}xploration \textbf{N}etwork (SEREN), a plug-and-play framework that casts the exploration-exploitation trade-off as a Markov game between an RL agent -- exploiter, which purely exploits task-dependent rewards, and another RL agent -- switcher, which chooses at which states to activate a \textit{pure exploration} policy that is trained to minimise system uncertainty and override exploiter. Using a form of policies known as \textit{impulse control}, switcher~is able to determine the best set of states to switch to the exploration policy while exploiter~is free to execute its actions everywhere else. We prove that the convergence of SEREN under linear regime, and show that it induces a natural schedule towards pure exploitation. Through extensive empirical studies in both discrete and continuous control benchmarks, we show that with minimal modification, SEREN can be readily combined with existing RL algorithms and yield performance improvement.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Vikas_Sindhwani1
Submission Number: 2283
Loading