Q-learning Penalized Transformer for Safe Offline Reinforcement Learning

Q-learning Penalized Transformer for Safe Offline Reinforcement Learning

ICLR 2026 Conference Submission16459 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: offline RL, safe RL

Abstract: This paper addresses the problem of safe offline reinforcement learning, which involves training a policy to satisfy safety constraints using an offline dataset. This problem is inherently challenging as it requires balancing three highly interconnected and competing objectives: satisfying safety constraints, maximizing rewards, and adhering to the behavior regularization imposed by the offline dataset. To tackle this trilogy challenge, we propose a novel framework, the Q-learning Penalized Transformer policy (QPT). Specifically, QPT adopts a sequence modeling paradigm, learning the action distribution conditioned on historical trajectories and target returns, thereby ensuring robust behavior regularization. Additionally, we incorporate Q-learning penalization into the training process to optimize the policy by maximizing the expected reward and minimizing the expected cost, guided by the learned Q-networks. Theoretical analysis demonstrates the advantages of our approach by aligning with optimal policies under mild assumptions. Experimental results across 38 tasks further validate the effectiveness of the QPT framework, demonstrating its ability to learn adaptive, safe, robust, and high-reward policies. Notably, QPT consistently outperforms strong safe offline RL baselines by a significant margin across all tasks. Furthermore, it retains zero-shot adaptation capabilities to varying constraint thresholds, making it particularly well-suited for real-world RL scenarios that operate under constraints.

Primary Area: reinforcement learning

Submission Number: 16459

Loading