Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Abstract: We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate on the dynamic nature of the token unmasking confidence across blocks and steps. Based on this observation, we then present a
lightweight adaptive approach that can control the generation block size, step size, and threshold based on the average confidence score of
the unmasked tokens. We further reduce the softmax-ing overhead of token probability generation by dynamically leveraging a subset of
vocabulary size to regulate sampling breadth. CadLLM is a plug-and-play model-agnostic with KV caching based dLLMs. Extensive experiments on four popular tasks demonstrate the efficacy of CadLLM to yield throughput improvement of up to 2.28× over the state-of-the-art baseline with competitive accuracy.
Loading