ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

Jing-Cheng Pang; Kaiyuan Li; Yidi Wang; Si-Hang Yang; Shengyi Jiang; Yang Yu

ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts

Jing-Cheng Pang, Kaiyuan Li, Yidi Wang, Si-Hang Yang, Shengyi Jiang, Yang Yu

01 Sept 2025 (modified: 20 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, benchmark, large language model

Abstract: A central challenge in reinforcement learning (RL) is its dependence on extensive real-world interaction data to learn task-specific policies. While recent work demonstrates that large language models (LLMs) can help mitigate this limitation by generating synthetic experience (noted as imaginary rollouts) for learning novel tasks, the absence of a standardized benchmark hinders progress in this emerging area. To bridge this gap, we introduce ImagineBench, the first comprehensive benchmark for evaluating offline RL algorithms that learn from both real rollouts and LLM-imaginary rollouts. The key features of ImagineBench include: (1) datasets comprising environment-collected and LLM-imaginary rollouts with verified quality; (2) diverse domains covering locomotion, robotic manipulation, and navigation tasks; and (3) natural language task instructions of varying complexity to support instruction-following policy learning. Through comprehensive experiments, we find that simply applying existing offline RL algorithms yields suboptimal generalization on unseen tasks, achieving only 35.44% task completion on unseen tasks compared to 64.37% for policies trained with real data. Meanwhile, the performance varies with instruction complexity, confirming that ImagineBench provides meaningful spectrum of task difficulty. Furthermore, we show that pre-training with imaginary rollouts leads to superior asymptotic performance after online fine-tuning. Based on these findings, ImagineBench identifies key directions for future research, including improved exploitation of imaginary rollouts, efficient online adaptation, continual learning, and extension to multi-modal task settings. Our code is available at https://anonymous.4open.science/r/Imagine_Bench_anonymous-40CD

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 545

Loading