Double Check My Desired Return: Transformer with Value Validation for Offline RL

Yue Pei; Hongming Zhang; Chao Gao; Martin Müller; Mengxiao Zhu; Hao Sheng; Haogang Zhu

Double Check My Desired Return: Transformer with Value Validation for Offline RL

Yue Pei, Hongming Zhang, Chao Gao, Martin Müller, Mengxiao Zhu, Hao Sheng, Haogang Zhu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Reinforcement Learning, Transformer

TL;DR: Integrating supervised learning and TD learning in a Transformer to better align actions with target returns in offline RL.

Abstract: Recently, there has been increasing interest in applying Transformers to offline reinforcement learning (RL). Existing methods typically frame offline RL as a sequence modeling problem and learn actions via Supervised learning (RvS). However, RvS-trained Transformers struggle to align actual returns with desired target returns, especially when dealing with underrepresented returns in the dataset (interpolation) or missed higher returns that could be achieved by stitching sub-optimal trajectories (extrapolation). In this work, we propose a novel method that Double Checks the Transformer with value validation for Offline RL (Doctor). Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value function. SL stabilizes the prediction of actions conditioned on target returns, while TD learning adds stitching capability to the Transformer. During inference, we introduce a double-check mechanism. We sample actions around desired target returns and validate them with value functions. This mechanism ensures better alignment between the predicted action and the desired target return and is beneficial for further online exploration and fine-tuning. We evaluate Doctor on the D4RL benchmark in both offline and offline-to-online settings, demonstrating that Doctor does much better in return alignment, either within the dataset or beyond the dataset. Furthermore, Doctor performs on par with or outperforms existing RvS-based and TD-based offline RL methods on the final performance.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13368

Loading