Trust-Region Twisted Policy Improvement

Joery A. de Vries; Jinke He; Yaniv Oren; Matthijs T. J. Spaan

Trust-Region Twisted Policy Improvement

Joery A. de Vries, Jinke He, Yaniv Oren, Matthijs T. J. Spaan

Published: 01 May 2025, Last Modified: 04 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We tailor the design choices within the particle filter planner for its specific use in reinforcement learning inspired by approaches from related Monte-Carlo tree search literature.

Abstract: Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our *Trust-Region Twisted* SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Lay Summary: An important part of sequential decision making is to effectively plan ahead the consequences of actions. Planning over time, however, is often an expensive task in terms of computational and time resources. One powerful approach used in recent years includes learning to steer the planner's search budget. This learning method is able to iterate on itself to improve the guidance quality further when given more data and learning time. Our paper combines ideas from two succesful approximate planning algorithms within this area to develop a new method. Our contributions ensure that 1) we can exploit specialized hardware more effectively, 2) we improve the quality of the data generated by the planning algorithm, and 3) improve how and where the planner should spend its budget. In our results we find that our method can greatly reduce the real time and data needed to learn a sequential decision making agent compared to previous approaches. This improves the applicability and accessibility of these types of algorithms by reducing their practical costs.

Link To Code: https://github.com/joeryjoery/trtpi

Primary Area: Reinforcement Learning->Planning

Keywords: Reinforcement Learning; Sequential Monte-Carlo; Monte-Carlo Tree Search; planning; model-based; policy improvement

Submission Number: 274

Loading