Keywords: Transformer, safe reinforcement learning
TL;DR: We propose the Constrained Q-learning Decision Transformer to address the problem of safe offline reinforcement learning.
Abstract: In the field of safe offline reinforcement learning (RL), the objective is to utilize offline data to train a policy that maximizes long-term rewards while adhering to safety constraints. Recent work, such as the Constrained Decision Transformer (CDT), has utilized the Transformer architecture to build a safe RL agent that is capable of dynamically adjusting the balance between safety and task rewards. However, it often lacks the stitching ability to output policies that are better than those existing in the offline dataset, similar to other Transformer-based RL agents like the Decision Transformer (DT). We introduce the Constrained Q-learning Decision Transformer (CQDT) to address this issue. At the core of our approach is a novel trajectory relabeling scheme that utilizes learned value functions, with careful consideration of the trade-off between safety and cumulative rewards. Experimental results show that our proposed algorithm outperforms several baselines across a variety of safe offline RL benchmarks.
Submission Number: 71
Loading