EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

Published: 01 Jun 2026, Last Modified: 11 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Large Language Models, Speculative Decoding, Rollout Acceleration, Self-Speculative Decoding, Quantization, ML Systems
TL;DR: EfficientRollout accelerates RL rollout generation by combining quantized self-speculative decoding with system-aware toggling and adaptive draft lengths, reducing rollout latency without changing training dynamics.
Abstract: Reinforcement learning (RL) has become a representative post-training paradigm for large language models (LLMs), but rollout generation remains a dominant latency bottleneck. Autoregressive (AR) sampling decodes responses sequentially, and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) is a well-established technique for serving fixed LLMs that reduces latency by drafting tokens and verifying them in parallel. However, its practical speedups do not directly carry over to RL rollouts: (i) the continuously evolving target model makes static drafters stale, and (ii) active batch sizes shrink throughout rollout decoding, changing whether verification overhead can be amortized. We present EfficientRollout, a system-aware self-SD framework for RL rollouts that induces a quantized drafter directly from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle with acceptance-aware draft-length adaptation, speculating only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline while preserving model quality. Code is available at https://github.com/furiosa-ai/EfficientRollout.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 118
Loading