Improving Exploration in Deep Reinforcement Learning by State Planning Policies

Anonymous

Improving Exploration in Deep Reinforcement Learning by State Planning Policies

Anonymous

08 Oct 2022 (modified: 05 May 2023)Submitted to Deep RL Workshop 2022Readers: Everyone

Keywords: reinforcement learning, state planning, lagrangian optimization, inverse dynamics model

TL;DR: We introduce an improvement for reinforcement learning algorithms for continuous setting called state planning policy RL

Abstract: We introduce an improvement for reinforcement learning (RL) algorithms for continuous setting called state planning policy RL (SPP-RL). In SPP-RL, the actor plans for the next state provided the current state. To communicate the actor output to the environment, we incorporate an inverse dynamics control model and train it using supervised learning. We evaluate our improvement using the off-policy state-of-the-art reinforcement learning algorithms: TD3 and SAC. The target states need to be physically relevant; the overall learning procedure is formulated as a constrained optimization problem, solved via the classical Lagrangian multipliers method. We benchmark the state planning RL approach using a set of Safety-gym level 0 (no safety cost involved) environments and the AntPush env.. We find that SPP-RL significantly beats the baselines in terms of average return. We assign the performance boost to the more efficient SPP-RL agent exploration, performed in the target-state space rather than the action space. We report numerical experiments confirming this finding.

Supplementary Material: zip

0 Replies

Loading