Guided Reinforcement Learning with Roll-Back

Lauren Y. Taylor; Wei Emma Zhang; Claudia Szabo

Guided Reinforcement Learning with Roll-Back

Lauren Y. Taylor, Wei Emma Zhang, Claudia Szabo

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, guide policy, warm-start

TL;DR: Prevent performance degradation when using a guide policy to warm-start online reinforcement learning.

Abstract: Reinforcement learning-based solutions are increasingly being considered as strong alternatives to classical system controllers, despite their significant sample inefficiency when learning controller tasks from scratch. Many methods that address this issue use prior task knowledge to guide the agent's learning, with several recent algorithms providing a guide policy that is sometimes chosen to execute actions instead of the learner policy. While this approach lends excellent flexibility as it allows the guide knowledge to be provided in any format, it can be challenging to decide when and for how long to use the guide agent. Current guide policy-based approaches typically choose a static guide sampling rate empirically, and do not vary it. Approaches that transfer control use simple methods like linear decay, or require hyperparameter choices that strongly impact the performance. We show that under certain assumptions, the sampling rate of the guide policy can be calculated to guarantee that the mean return of the learning policy will surpass a user-defined performance degradation threshold. To the best of our knowledge, this is the first time a performance guarantee has been established for a guided RL method. We then implement a guided RL (GRL) algorithm that can make use of this sample rate, and additionally introduce a roll-back feature in guided RL with roll-back (GRL-RB) to adaptively balance the trade-off between performance degradation and rapid transfer of control to the learner. Our approach is simple to implement on top of existing algorithms, robust to hyperparameter choices, and effective in warm-starting online learning.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5884

Loading