Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

28 Feb 2026 (modified: 14 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data that the optimizer ultimately learns from, yet rollout design is often treated as an implementation detail and underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate–Filter–Control–Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular and composable stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, or critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks and data. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes the trade-offs rollout designs must navigate. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, systems-level throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a practitioner-oriented diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=48qVH5g7Ez

Changes Since Last Submission: Minor formatting-only updates were made to match the TMLR template. No technical content was changed.

Assigned Action Editor: ~Xinrun_Wang1

Submission Number: 7709

Loading