Trajectory Generation for Offline-to-Online Reinforcement Learning via Entropy Perspective

17 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Offline to Online Reinforcement Learning
Abstract: Offline reinforcement learning can learn strong policies without environment interaction, but the performances of these policies are often limited when deployed online, where out-of-distribution (OOD) states and actions induce Q-function overestimation. Recent offline-to-online (O2O) methods alleviate OOD overestimation, yet they struggle with a central challenge: how to allocate limited online exploration effectively. We introduce a generative data augmentation framework that directs exploration and synthesis toward the most beneficial regions of the state–action space. Our approach quantifies underfitting rather than ood through predictive uncertainty—measured via ensemble entropy or a single-model proxy—and uses a small budget of online interaction to collect high-uncertainty transitions. These uncertainty-targeted samples then condition a trajectory generator, which produces synthetic data to cover underrepresented but task-relevant regions. The policy is finetuned on a mixture of offline data, few-shot online samples, and guided synthetic trajectories, thereby improving coverage and value estimation without requiring extensive exploration. On D4RL locomotion benchmarks, our method consistently surpasses offline-only baselines and O2O finetuning without generative guidance, achieving higher normalized returns under equal or smaller online budgets. Ablation studies further demonstrate the importance of uncertainty calibration, generator guidance, and real–synthetic data balance, highlighting uncertainty-guided generation as an effective remedy for OOD overestimation and exploration inefficiency.
Primary Area: reinforcement learning
Code Of Ethics: true
Submission Guidelines: true
Anonymous Url: true
No Acknowledgement Section: true
Submission Number: 8504
Loading