Keywords: offline RL, safe RL
Abstract: This paper addresses the problem of safe offline reinforcement learning, which involves training a policy to satisfy safety constraints using an offline dataset. This problem is inherently challenging as it requires balancing three highly interconnected and competing objectives: satisfying safety constraints, maximizing rewards, and adhering to the behavior regularization imposed by the offline dataset. To tackle this trilogy challenge, we propose a novel framework, the Q-learning Penalized Transformer policy (QPT). Specifically, QPT adopts a sequence modeling paradigm, learning the action distribution conditioned on historical trajectories and target returns, thereby ensuring robust behavior regularization. Additionally, we incorporate Q-learning penalization into the training process to optimize the policy by maximizing the expected reward and minimizing the expected cost, guided by the learned Q-networks. Theoretical analysis demonstrates the advantages of our approach by aligning with optimal policies under mild assumptions. Experimental results across 38 tasks further validate the effectiveness of the QPT framework, demonstrating its ability to learn adaptive, safe, robust, and high-reward policies. Notably, QPT consistently outperforms strong safe offline RL baselines by a significant margin across all tasks. Furthermore, it retains zero-shot adaptation capabilities to varying constraint thresholds, making it particularly well-suited for real-world RL scenarios that operate under constraints.
Primary Area: reinforcement learning
Submission Number: 16459
Loading