D2T2: Decision Transformer with Temporal Difference via Steering Guidance

Hao-Lun Hsu; Juncheng Dong; Qitong Gao; Alper Kamil Bozkurt; Vahid Tarokh; Miroslav Pajic

D2T2: Decision Transformer with Temporal Difference via Steering Guidance

Hao-Lun Hsu, Juncheng Dong, Qitong Gao, Alper Kamil Bozkurt, Vahid Tarokh, Miroslav Pajic

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Offline Reinforcement Learning, Reinforcement Learning via Supervised Learning, Decision Transformer

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Despite the promising performance of Decision Transformers (DT) on a wide range of tasks, recent studies have found that the performance of DT may largely be dependent on the specific characteristics of the task of interest including, most importantly, the stochasticity of the environment. We first focus on this issue and prove that a well-trained DT can recover the optimal trajectory almost surely in an environment with random initial states but deterministic transition and rewards, explaining the remarkable performance of DT in deterministic tasks. Notably, it follows from our analysis that for stochastic transition and rewards, the performance of DT may degrade significantly due to the growing variance of returns-to-go (RTG) accumulated over the horizon. To this end, we extend DT to Decision Transformer with Temporal Difference via Next-State Guidance (D2T2) which addresses the growing variance problem of RTGs and leads to significantly improved performance in stochastic tasks. D2T2 maps the current state to a guiding vector that steers DT toward high-reward regions where the expected returns are approximated by temporal difference learning. This approach also addresses another severe challenge faced by DT which is its requirement of RTGs as input upon evaluation/deployment. Experimental results on various stochastic tasks and D4RL environments are provided to establish the superior performance of our proposed method compared to the state-of-the-art (SOTA) offline reinforcement learning methods.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2134

Loading