Single-Step Initialization for Exploratory Parallel Rollouts in Diffusion LLMs
Keywords: Diffusion language models, parallel decoding, rollout generation, policy optimization, GRPO, exploration, rollout diversity, early branching, initialization.
TL;DR: Training free parallel decoding accelerates dLLM rollouts, but confidence based selection delays branching. A single random initialization step restores exploration and improves diversity, reasoning potential, and RL performance.
Abstract: We propose training-free parallel decoding for rollout generation in diffusion large language model (dLLM) policy optimization, reducing rollout cost without auxiliary models or policy modification.
We find, however, that confidence-based decoding suffers from delayed branching, and parallel decoding largely inherits this characteristic.
Rollouts agree on both unmasked tokens and positions for much of generation, leading to a lack of exploration that weakens the group-relative learning signal.
We address this with a minimal initialization step in which each rollout independently unmasks one uniformly random position after which the original sampler resumes unchanged.
The intervention is drop-in compatible with any sampling strategies.
Combined with Fast-dLLM on LLaDA-8B-Instruct, it improves rollout diversity and yields stronger downstream RL performance on GSM8K and MATH-500.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 262
Loading