Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy SearchDownload PDF

03 Nov 2022 (modified: 05 May 2023)MLmDS 2023Readers: Everyone
Keywords: reinforcement learning, policy optimization, explore/exploit
TL;DR: This work proposes a new reinforcement learning objective that explicitly addresses the exploration/exploitation trade-off in dynamic environments.
Abstract: We develop a new measure of the exploration/exploitation trade-off in infinite-horizon reinforcement learning (RL) problems called the occupancy information ratio (OIR), which is comprised of a ratio between the infinite-horizon average cost of a policy and the entropy of its induced long-term state occupancy measure. Modifying the classic RL objective in this way yields policies that strike an optimal balance between exploitation and exploration, providing a new tool for addressing the exploration/exploitation trade-off in RL. The paper develops for the first time policy gradient and actor-critic algorithms for OIR optimization based upon a new entropy gradient theorem, and establishes both asymptotic and non-asymptotic convergence results with global optimality guarantees. In experiments, these methodologies outperform several deep RL baselines in problems with sparse rewards, where many trajectories may be uninformative and skepticism about the environment is crucial to success.
0 Replies

Loading