Keywords: Offline reinforcement learning, policy gradient, long horizon
TL;DR: This paper introduces a policy gradient method that plans action sequences in high-dimensional spaces and replaces maximum likelihood with cross-entropy loss, significantly improving stability and performance in long-horizon tasks.
Abstract: Offline reinforcement learning methods, which typically train agents that make decisions step by step, are known to suffer from instability due to bootstrapping and function approximation, especially when applied to tasks requiring long-horizon planning. To alleviate these issues, in this paper, we propose a novel policy gradient approach by planning an action sequence in a high-dimensional space.This design implicitly models temporal dependencies, excelling in long-horizon and horizon-critical tasks. Furthermore, we discover that replacing maximum likelihood with cross-entropy loss in policy gradient methods significantly stabilizes training gradients, leading to substantial performance improvements in long-horizon tasks. The proposed neural network-based solution features a simple architecture that not only facilitates ease of training and convergence but also demonstrates high efficiency and effective performance. Extensive experimental results reveal that our method exhibits strong performance across a variety of tasks.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12879
Loading