LightningRL: Breaking the Accuracy–Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning

Yanzhe Hu; Yijie Jin; Pengfei Liu; Kai Yu; Zhijie Deng

LightningRL: Breaking the Accuracy–Parallelism Trade-off of Block-wise dLLMs via Reinforcement Learning

Yanzhe Hu, Yijie Jin, Pengfei Liu, Kai Yu, Zhijie Deng

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: An RL framework for dLLMs that boosts reasoning accuracy without sacrificing parallel speed, optimizing the accuracy–parallelism trade-off via trajectory-aware credit assignment.

Abstract: Diffusion Large Language Models (dLLMs) enable parallel token generation, and their block-wise variants have attracted significant attention. However, existing dLLMs usually exhibit an accuracy–parallelism trade-off, where raising tokens per forward (TPF) via aggressive parallel decoding often degrades task accuracy. To address this, we suggest developing a post-training approach to directly optimize the speed–quality frontier of pre-trained dLLMs. Conceptually, we do not require the model to decode aggressively along all sampling trajectories, but rather to find several highly parallelizable ones that can yield correct results. To this end, we resort to a reinforcement learning paradigm, i.e., LightningRL, to optimize rewards regarding both the final accuracy and inference parallelism. LightningRL follows the Group Relative Policy Optimization (GRPO) framework, with further improvements for dLLMs: 1) stabilized training via per-reward decoupled normalization, 2) token-level negative log-likelihood (NLL) loss on correct trajectories for regularization, and 3) improved training efficiency through dynamic sampling with TPF-aware filtering. Across maths and code tasks, LightningRL consistently advances the Pareto frontier, maintaining competitive accuracy while increasing parallelism to an average TPF of 7.3 (up to 11.10 on MBPP).

Lay Summary: Modern language models usually generate text one small piece at a time, which can make them slow when producing long answers. A newer family of models can instead fill in several pieces of text at once, much like completing multiple blanks in a sentence. This parallel generation has the potential to make AI systems faster, but in practice it often comes with a cost: when the model tries to generate too much at once, its answers become less accurate. This paper introduces LightningRL, a training method that helps such models find a better balance between speed and quality. Rather than forcing the model to always generate as many pieces as possible at once, LightningRL rewards generation paths that are both fast and correct. It also includes safeguards that keep the model from becoming fast in ways that harm answer quality. We test LightningRL on math reasoning and programming tasks. The results show that it can substantially increase the amount of text generated in parallel while maintaining competitive, and sometimes improved, accuracy. This suggests that future language models may be able to produce reliable answers more quickly, which could make AI systems more efficient and practical to use.

Link To Code: https://github.com/SJTU-DENG-Lab/LightningRL

Primary Area: Reinforcement Learning->Deep RL

Keywords: Diffusion Large Language Models, Reinforcement Learning, Inference Acceleration

Originally Submitted PDF: pdf

Submission Number: 15359

Loading