Keywords: Training dynamics, Diffusion language models, Adaptive computation, Certifiable inference, Early exit, Representation stability
TL;DR: TRACE explains why and certifies when diffusion LLMs can safely stop denoising early, reusing lightweight training dynamics to save 11–68% of steps with no accuracy loss.
Abstract: Supervised fine-tuning of diffusion language models induces structured neural representations that persist after training and can guide inference. We show that optimization dynamics leave behind actionable signals: aggregating AdamW moment trajectories on Low-Rank Adaptation (LoRA) parameters yields a Reasoning Representation Map (RRM), and monitoring its alignment with token activations defines a Representational Alignment Distribution (RAD).
Our central contribution is to explain not merely that early termination is possible, but why it is safe. We prove that small matched-support Kullback–Leibler divergence across consecutive denoising steps bounds multi-step total variation drift, yielding certificates for stability and a no-flip guarantee for predicted tokens. Under mild contraction assumptions, these local guarantees extend globally, showing that RAD stability directly governs inference stability.
Building on this theory, we present the Training-Refined Adaptive Computation Exit (TRACE) algorithm, which halts generation once RAD stability persists. TRACE reuses lightweight optimizer metadata, requires no retraining or architectural changes, and consistently reduces denoising steps while preserving accuracy. More importantly, it provides the first principled account of why training dynamics encode certifiable inference signals.
Our results demonstrate that optimizer states, often discarded after training, encode interpretable structure that links training and inference, enabling adaptive computation with provable safety.
Primary Area: learning theory
Submission Number: 2522
Loading