Keywords: reinforcement learning, state planning, lagrangian optimization, inverse dynamics model
TL;DR: We introduce an improvement for reinforcement learning algorithms for continuous setting called state planning policy RL
Abstract: We introduce an improvement for reinforcement learning (RL) algorithms for continuous setting called state planning policy RL (SPP-RL). In SPP-RL, the actor plans for the next state provided the current state. To communicate the actor output to the environment, we incorporate an inverse dynamics control model and train it using supervised learning.
We evaluate our improvement using the off-policy state-of-the-art reinforcement learning algorithms: TD3 and SAC.
The target states need to be physically relevant; the overall learning procedure is formulated as a constrained optimization problem, solved via the classical Lagrangian multipliers method. We benchmark the state planning RL approach using a set of Safety-gym level 0 (no safety cost involved) environments and the AntPush env..
We find that SPP-RL significantly beats the baselines in terms of average return. We assign the performance boost to the more efficient SPP-RL agent exploration, performed in the target-state space rather than the action space. We report numerical experiments confirming this finding.
Supplementary Material: zip
0 Replies
Loading