Flow-Guided Latent Refiner Policies for Safe Offline Reinforcement Learning

ICLR 2026 Conference Submission19479 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline Reinforcement Learning; Safe Reinforcement Learning
Abstract: Safe offline reinforcement learning remains challenging due to two coupled obstacles: (i) reconciling soft penalty designs with hard safety requirements, and (ii) avoiding out-of-distribution (OOD) actions when the learned policy departs from the behavior data. Existing approaches often rely on penalty tuning that under- or over-regularizes safety, solve constrained objectives that depend on accurate simulators or online rollouts, or train powerful generative policies that still explore low-density, safety-unknown regions at deployment. We introduce a constraint-free offline framework that addresses both issues by (a) modeling the latent action manifold via a trainable flow-based density conditioned on state to explicitly concentrate probability mass on high-density—and empirically safe—regions, and (b) applying a lightweight refiner stage that performs small, ordered updates in the latent space to jointly improve reward and safety before decoding actions. This design keeps policy search inside the modelled data manifold, while a feasibility-aware training signal steers the refiner toward low-violation solutions without requiring explicit constraints or online interaction. Across various safe offline benchmarks, the proposed method achieves lower violation rates while matching or outperforming baselines in return, demonstrating its potential as a practical and effective approach to safer offline policy learning.
Primary Area: reinforcement learning
Submission Number: 19479
Loading