Generalized Maximum Entropy Reinforcement Learning via Reward ShapingDownload PDF

Anonymous

Sep 29, 2021 (edited Oct 01, 2021)ICLR 2022 Conference Blind SubmissionReaders: Everyone
  • Keywords: Reinforcement Learning, Reward Shaping, Soft Policy Gradient
  • Abstract: Entropy regularization is a commonly used technique in reinforcement learning to improve exploration and cultivate a better pre-trained policy for later adaptation. Recent studies further show that the use of entropy regularization can smooth the optimization landscape and simplify the policy optimization process, which indicates the value of integrating entropy into reinforcement learning. However, existing studies only consider the policy’s entropy at the current state as an extra regularization term in the policy gradient or in the objective function, while the topic of integrating the entropy into the reward function has not been investigated. In this paper, we propose a shaped reward that includes the agent’s policy entropy into the reward function. In particular, the agent’s entropy at the next state is added to the immediate reward associated with the current state. The addition of the agent’s policy entropy at the next state, instead of the policy entropy at the current state as used in the existing maximum entropy reinforcement learning framework, considers both state and action uncertainties. This distinguishes our work from the existing maximum entropy reinforcement learning framework via providing better action exploration and better control policies. We also show the addition of the agent’s policy entropy at the next state yields new soft Q function and state value function that are concise and modular. Hence, the new reinforcement learning framework can be easily applied to the existing standard reinforcement learning algorithms while inheriting the benefits of employing entropy regularization. We further present a soft stochastic policy gradient theorem based on the shaped reward and propose a new practical reinforcement learning algorithm. Finally, a few experimental studies are conducted in the MuJoCo environment to demonstrate that our method can outperform the existing state-of-the-art reinforcement learning approaches.
  • Supplementary Material: zip
0 Replies

Loading