COP-Q: Safety-First Reinforcement Learning with Cholesky Ordered Projection

Guopeng Li; Moritz Akiya Zanger; Matthijs T. J. Spaan; Julian F. P. Kooij

COP-Q: Safety-First Reinforcement Learning with Cholesky Ordered Projection

Guopeng Li, Moritz Akiya Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Safe reinforcement learning, Multi-objective reinforcement learning, Uncertainty quantification

TL;DR: Guiding exploitation and exploration by multi-objective uncertainty in Q-values for safety-first reinforcement learning

Abstract: Using uncertainty in Q-values to mitigate overestimation, enhance exploration, and ensure safety has proven effective in single-objective deep Q-learning. However, when learning vector-valued Q-functions for correlated goals, uncertainties become intertwined across objectives. Conventional approaches either treat uncertainty in each objective independently or collapse them into one dimension, often resulting in unstable learning, low sample efficiency, limited exploration, and particularly unsafe behaviours. To address these challenges, this study proposes Cholesky Ordered Projection Q-learning (COP-Q), a novel method that guides safety-first exploitation and exploration using full multi-objective uncertainty. We first propose generalized multi-objective confidence bounds via covariance matrix factorization. For priority-ordered objectives, such as in safety-critical or cost-constrained reinforcement learning, Cholesky factorization is employed to incorporate inter-objective covariance into confidence bounds in a conditionally sequential manner. The lower bound yields conservative temporal difference targets to reduce overestimation, while the upper bound assigns optimistic Q-values to promote exploration. COP-Q is evaluated on standard MuJoCo and velocity-constrained SafetyVelocity-v1 benchmarks, demonstrating robust safety performance and competitive total returns. The proposed method is compatible with various deep Q-learning frameworks with minimal computational overhead, making it practical for a wide range of multi-objective and constrained reinforcement learning tasks.

Primary Area: reinforcement learning

Submission Number: 24260

Loading