Time to Truncate Trajectory: Stochastic Retrace for Multi-step Off-policy Reinforcement Learning
Keywords: multi-step off-policy reinforcement learning
Abstract: While off-policy reinforcement learning methods that is based on one-step temporal difference learning have shown to be promising for solving complex decision making problems, multi-step lookahead from behavior policies is still challenging by the discrepancy between behavior policy and target policy.
Several recent works have addressed this challenge by introducing the coefficients to correct the discrepancy such as Retrace and evolving the behavior policy in a similar way to conservative policy iteration such as Peng's $Q(\lambda)$.
However, both methods do not universally work well by the policy evaluation error caused by the value from the later part of a long trajectory data.
In this work, we propose a stochastic truncation method which modify the correction coefficeints of Retrace into the sequence of Bernoulli random variables to remove the later part of trajectory, which degrades off-policy evaluation by adding the unnecessary noise.
Unlike prior method for reducing the off-policy discrepancy,
our stochastic truncation enjoys two strengths form the conservative and non-conservative multi-step RL methods.
We demonstrate that our algorithm, time-to-truncate-trajectory (T4) outperforms various model-free RL methods.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Resubmission: No
Student Author: Yes
Large Language Models: Yes, at the sentence level (e.g., fixing grammar, re-wording sentences)
Submission Number: 9158
Loading