- Keywords: reinforcement learning, kinodynamic planning, policy search
- TL;DR: We employ a sample-based planning method for more directed exploration and efficiency in policy learning
- Abstract: Most Deep Reinforcement Learning methods perform local search and therefore are prone to get stuck on non-optimal solutions. Furthermore, in simulation based training, such as domain-randomized simulation training, the availability of a simulation model is not exploited, which potentially decreases efficiency. To overcome issues of local search and exploit access to simulation models, we propose the use of kino-dynamic planning methods as part of a model-based reinforcement learning method and to learn in an off-policy fashion from solved planning instances. We show that, even on a simple toy domain, D-RL methods (DDPG, PPO, SAC) are not immune to local optima and require additional exploration mechanisms. We show that our planning method exhibits a better state space coverage, collects data that allows for better policies than D-RL methods without additional exploration mechanisms and that starting from the planner data and performing additional training results in as good as or better policies than vanilla D-RL methods, while also creating data that is more fit for re-use in modified tasks.