Divide and Conquer: Selective Value Learning and Policy Optimization for Offline Safe Reinforcement Learning

Divide and Conquer: Selective Value Learning and Policy Optimization for Offline Safe Reinforcement Learning

TMLR Paper6765 Authors

02 Dec 2025 (modified: 22 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Offline safe reinforcement learning (RL) aims to learn policies that maximize reward while satisfying safety constraints from a fixed dataset. Existing methods extend offline RL with primal–dual value learning and behavior-regularized policy optimization, but in safety-critical tasks they struggle: uniform updates across all states ignore the difference between safety-preserving and unsafe states, leading to inaccurate value estimates, infeasible solutions when constraints conflict, and strong sensitivity to dataset quality. We propose SEVPO($\textbf{SE}$lective $\textbf{V}$alue Learning and $\textbf{P}$olicy $\textbf{O}$ptimization), a divide-and-conquer framework that separates updates based on state safety. SEVPO learns conservative cost values to identify safe states, applying reward-constrained optimization with selective regularization there, and switches to cost-minimization outside to compute least-cost escape paths. Extensive experiments show SEVPO achieves high reward and strict safety guarantees, outperforming state-of-the-art offline safe RL across diverse dataset qualities. We further validate SEVPO by training a Unitree Go2 quadruped robot in dynamic environments using only offline data, demonstrating its potential for safety-critical robotics (https://youtu.be/tDpWq2EV_Ig).

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Wilka_Torrico_Carvalho1

Submission Number: 6765

Loading