Trajectory Generation with Conservative Value Guidance for Offline Reinforcement Learning

Trajectory Generation with Conservative Value Guidance for Offline Reinforcement Learning

ICLR 2026 Conference Submission22656 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Sequential Decision Making

TL;DR: We propose TGCVG, a simple yet effective offline RL framework that synthesizes high-quality trajectories via conservative value-guided generation.

Abstract: Recent advances in offline reinforcement learning (RL) have led to the development of high-performing algorithms that achieve impressive results across standard benchmarks. However, many of these methods depend on increasingly complex planning architectures, which hinder their deployment in real-world settings due to high inference costs. To overcome this limitation, recent research has explored data augmentation techniques that offload computation from online decision-making to offline data preparation. Among these, diffusion-based generative models have shown potential in synthesizing diverse trajectories but incur significant overhead in training and data generation. In this work, we propose Trajectory Generation with Conservative Value Guidance (TGCVG), a novel trajectory-level data augmentation framework that integrates a high-performing offline policy with a learned dynamics model. To ensure that the synthesized trajectories are both high-quality and close to the original dataset distribution, we introduce a value-guided regularization during the training of the offline policy. This regularization encourages conservative action selection, effectively mitigating distributional shift during trajectory synthesis. Empirical results on standard benchmarks demonstrate that TGCVG not only improves the performance of state-of-the-art offline RL algorithms but also significantly reduces training and trajectory synthesis time. These findings highlight the effectiveness of value-aware data generation in improving both efficiency and policy performance.

Primary Area: reinforcement learning

Submission Number: 22656

Loading