Enhancing Exploration via Off-Reward Dynamic Reference Reinforcement Learning

Published: 01 Aug 2024, Last Modified: 09 Oct 2024EWRL17EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Off-Reward Dynamic Reference Policy (ORDRP), Dynamic Reference, KL Regularization, graph Laplacian, Maximum Occupancy Principle, Actor Critic
TL;DR: This paper introduces a novel reinforcement learning approach that enhances exploration and performance by training a dynamic off-reward reference policy alongside the main policy, showing superior results in challenging environments.
Abstract:

In reinforcement learning (RL), balancing exploration and exploitation is essential for maximizing a target reward function. Traditional methods often employ regularizers to prevent the policy from becoming deterministic too early, such as penalizing deviations from a static reference policy. This paper introduces a novel approach by {\em jointly} training an off-reward dynamic reference policy (ORDRP) with the target policy, using a distinct reward function to guide exploration. We employ Kullback–Leibler divergence between the target policy and the dynamic reference policy as a regularization mechanism. Crucially, we provide a formal proof of convergence for the ORDRP iteration method, establishing its theoretical soundness. Our approach is validated within an actor-critic framework, with the ORDRP trained either using the maximum occupancy principle or Laplacian intrinsic off-rewards. Experimental results in challenging environments demonstrate that incorporating a jointly trained ORDRP enhances exploration, resulting in superior performance and higher sample efficiency compared to state-of-the-art baselines. These findings highlight the benefits of learning the reference policy alongside the main policy, leading to improved learning outcomes. Project page: https://yamenhabib.com/ORDRP/

Supplementary Material: zip
Submission Number: 45
Loading