Robust Constrained Offline Reinforcement Learning with Linear Function Approximation

Wenbin Wang; He Wang

Robust Constrained Offline Reinforcement Learning with Linear Function Approximation

Wenbin Wang, He Wang

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Research Track

Keywords: Offline reinforcement learning, distributionally robustness, constrained MDP, linear function approximation

Abstract: Bridging the sim-to-real gap requires reinforcement learning policies that achieve not only high rewards but also safety and robustness under distribution shifts. Yet the high-dimensionality of the state-action space makes learning sample-inefficient. To this end, we study robust constrained linear Markov decision processes (Lin-RCMDPs) in the offline setting, where an agent seeks to maximize expected return while satisfying safety constraints against the worst-case dynamics drawn from an ambiguity set defined by a total-variation ball. We propose a sample-efficient, model-based primal-dual algorithm CROP-VI that integrates robust planning with rectified Lagrangian updates to ensure constraint feasibility across all transitions in the ambiguity set. Specifically, we introduce pessimism into the reward function to prevent over-estimation, and apply asymmetric optimism to constraint to balance exploration-exploitation trade-off. Under mild data-coverage assumptions, we establish the first instance-dependent sub-optimality bound of CROP-VI for Lin-RCMDPs, where the learned policy is not only feasible with respect to the worst-case model but achieves near-optimal robust return. We further establish the sample-complexity bound of CROP-VI under partial or full feature coverage data, and extend the analysis beyond the linear MDP idealization to a misspecified regime, showing performance degrades gracefully with approximation error.

Submission Number: 42

Loading