Flow-Guided Latent Refiner Policies for Safe Offline Reinforcement Learning

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning; Safe Reinforcement Learning
Abstract: Safe offline reinforcement learning remains challenging due to two coupled obstacles: (i) reconciling soft penalty designs with hard safety requirements, and (ii) avoiding out-of-distribution (OOD) actions when the learned policy departs from the behavior data. Existing approaches often rely on penalty tuning that under- or over-regularizes safety, solve constrained objectives that depend on accurate simulators or online rollouts, or train powerful generative policies that still explore low-density, safety-unknown regions at deployment. We introduce a constraint-free offline framework that addresses both issues by (a) learning a flow-based latent action manifold that concentrates density on empirically safe regions and admits tractable bounds on policy deviation and OOD shift, and (b) applying a lightweight refiner stage that performs small, ordered updates in latent space to decouple reward, safety, and OOD control, stabilizing multi-objective optimization. This design keeps policy search inside the modeled data manifold, while a feasibility-aware training signal steers the refiner toward in-support, low-violation solutions without requiring explicit constraints or online interaction. Across a range of safe offline benchmarks, the proposed method achieves lower violation rates while matching or outperforming baselines in return, demonstrating its potential as a practical and effective approach to safer offline policy learning.
Primary Area: reinforcement learning
Submission Number: 19479
Loading