State Planning Policies Online Reinforcement Learning

Jacek Cyranka, Piotr Bilinski

Published: 01 Jan 2024, Last Modified: 16 Apr 2025CDC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We introduce State Planning Policy Reinforcement Learning (SPP-RL), an online RL approach, where the actor plans for the next state given the current state. To communicate the actor output to the environment, we incorporate an inverse dynamics control model and train it using supervised learning. SPP-RL introduces a novel way of ensuring reachability of planned target states by employing constrained optimization and the Lagrange multiplier method. We demonstrate the versatility of the SPP-RL by implementing variants of 3 standard RL algorithms: DDPG, TD3, and SAC. We conduct a thorough evaluation across 7 benchmarks, including Safety-Gym Level 0, AntPush and MuCoJo. Our results consistently showcase that SPP algorithms outperform their vanilla counterparts in terms of the average return in a systematic and significant manner. Moreover, we present SPP-RL convergence proof within a finite setting. Furthermore, we share our source code and trained models.