Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu

Published: 28 Jan 2026, Last Modified: 28 Jan 2026Arxiv 2026EveryoneCC BY 4.0

Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate on the dynamic nature of the token unmasking confidence across blocks and steps. Based on this observation, we then present a lightweight adaptive approach that can control the generation block size, step size, and threshold based on the average confidence score of the unmasked tokens. We further reduce the softmax-ing overhead of token probability generation by dynamically leveraging a subset of vocabulary size to regulate sampling breadth. CadLLM is a plug-and-play model-agnostic with KV caching based dLLMs. Extensive experiments on four popular tasks demonstrate the efficacy of CadLLM to yield throughput improvement of up to 2.28× over the state-of-the-art baseline with competitive accuracy.