Abstract: Efficient reinforcement learning (RL) involves a trade-off between
``exploitative'' actions that maximise expected reward and ``explorative''
actions that lead to the visitation of ``novel'' states. To encourage
exploration, existing methods proposed methods such as injecting
stochasticity into action selection, implicit regularisation, and synthetic
heuristic rewards. However, these techniques do not necessarily offer
systematic approach for making this trade-off. Here we introduce
\textbf{SE}lective \textbf{R}einforcement \textbf{E}xploration
\textbf{N}etwork (SEREN), a plug-and-play framework that casts the
exploration-exploitation trade-off as a Markov game between an RL agent --
exploiter, which purely exploits task-dependent rewards, and another RL
agent -- switcher, which chooses at which states to activate a \textit{pure
exploration} policy that is trained to minimise system uncertainty and
override exploiter. Using a form of policies known as \textit{impulse
control}, switcher~is able to determine the best set of states to switch to
the exploration policy while exploiter~is free to execute its actions
everywhere else. We prove that the convergence of SEREN under linear regime, and show that it induces a natural
schedule towards pure exploitation. Through extensive empirical studies in
both discrete and continuous control benchmarks, we show that with minimal
modification, SEREN can be readily combined with existing RL algorithms and
yield performance improvement.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Vikas_Sindhwani1
Submission Number: 2283
Loading