Keywords: Reinforcement learning, benchmark, large language model
Abstract: A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can help mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for learning novel tasks, the absence of a standardized benchmark hinders progress in this emerging area. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that learn from both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts with verified quality; (2) diverse domains covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions of varying complexity to support instruction-following policy learning. Through comprehensive experiments, we find that simply applying existing offline RL algorithms yields suboptimal generalization on unseen tasks, achieving only 35.44% task completion on unseen tasks compared to 64.37% for policies trained with real data. Meanwhile, the performance varies with instruction complexity, confirming that ImagineBench provides meaningful spectrum of task difficulty. Furthermore, we show that pre-training with imaginary rollouts leads to superior asymptotic performance after online fine-tuning. Based on these findings, ImagineBench identifies key directions for future research, including improved exploitation of imaginary rollouts, efficient online adaptation, continual learning, and extension to multi-modal task settings. Our code is available at https://anonymous.4open.science/r/Imagine_Bench_anonymous-40CD
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 545
Loading