Provably Safe Representation Learning in CMDPs: A Primal-Dual Approach

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning; Safe Representation Learning; Constrained Markov Decision Process
Abstract: We study representation learning in low-rank Constrained Markov Decision Processes (CMDPs) with unknown dynamics, where the agent must maximize rewards under safety constraints. While representation learning has significantly advanced for unconstrained MDPs, its extension to CMDPs remains open due to the critical challenge of safe exploration under learned features, particularly concerning the management of soft constraint violation. In this work, we propose REP-PD, the first algorithm that provably integrates representation learning with policy optimization in low-rank CMDPs. By iteratively learning a low-rank transition representation via MLE and utilizing a composite Q-function tied to the unconstrained Lagrangian, REP-PD guides policy updates to balance reward maximization, exploration, and robust constraint adherence. Through this approach, REP-PD achieves a near-optimal policy with a sampling complexity bound independent of the state space dimension without prior feature knowledge. Notably, REP-PD's regret matches the lower bounds for unconstrained low-rank MDPs, achieving strong performance concerning soft constraint violation. We then consider a stronger hard constraint violation metric, where the agent must strictly satisfy constraints at all times, and propose REP-PD-hard by designing a novel policy optimization module. Our work thus provides a robust and theoretically grounded approach to representation learning in constrained reinforcement learning, with guarantees on bounded soft and hard constraint violation.
Primary Area: reinforcement learning
Submission Number: 7104
Loading