From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Release for Offline-to-Online RL
Keywords: Reinforcement Learning, Offline-to-Online RL, Energy-Guided Diffusion Model
Abstract: Offline-to-online reinforcement learning (off2on RL) integrates the sample efficiency of offline pretraining with the adaptability of online fine-tuning. However, it suffers from a constraint-release dilemma: conservative objectives inherited from offline training ensure stability yet hinder adaptation, while uniformly discarding them induces instability. Existing approaches impose global constraints across all samples, thereby overlooking the distributional heterogeneity wherein offline and online data gradually overlap. We propose Dynamic Alignment for RElease (DARE), a distribution-aware framework that enforces the constraints at the sample level in a behavior-consistent manner. To this end, DARE employs a diffusion-based behavior model with energy guidance to generate reference actions, assigns alignment scores to individual samples, leverages Gaussian fitting to distinguish offline-like from online-like data, and exchange behavior-inconsistent samples between offline and online batches to ensure behavior-consistent constraint enforcement. We theoretically prove that DARE reduces offline–online distributional discrepancy while ensuring that value estimation errors remain bounded. Our empirical results on the D4RL benchmark demonstrate that integrating DARE into representative off2on methods (Cal-QL and IQL) consistently improves policy performance and achieves stable, robust, and adaptive fine-tuning. (Anonymized code archive is included in the supplementary material)
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 15069
Loading