Keywords: Offline Reinforcement Learning, Transformer
TL;DR: Integrating supervised learning and TD learning in a Transformer to better align actions with target returns in offline RL.
Abstract: Recently, there has been increasing interest in applying Transformers to offline reinforcement learning (RL). Existing methods typically frame offline RL as a sequence modeling problem and learn actions via Supervised learning (RvS). However, RvS-trained Transformers struggle to align actual returns with desired target returns, especially when dealing with underrepresented returns in the dataset (interpolation) or missed higher returns that could be achieved by stitching sub-optimal trajectories (extrapolation). In this work, we propose a novel method that Double Checks the Transformer with value validation for Offline RL (Doctor). Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value function. SL stabilizes the prediction of actions conditioned on target returns, while TD learning adds stitching capability to the Transformer. During inference, we introduce a double-check mechanism. We sample actions around desired target returns and validate them with value functions. This mechanism ensures better alignment between the predicted action and the desired target return and is beneficial for further online exploration and fine-tuning. We evaluate Doctor on the D4RL benchmark in both offline and offline-to-online settings, demonstrating that Doctor does much better in return alignment, either within the dataset or beyond the dataset. Furthermore, Doctor performs on par with or outperforms existing RvS-based and TD-based offline RL methods on the final performance.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13368
Loading