Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Benjamin Rozonoyer; Jacopo Minniti; Dhruvesh Patel; Neil Band; Joey Bose; Tim G. J. Rudner; Andrew McCallum

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Benjamin Rozonoyer, Jacopo Minniti, Dhruvesh Patel, Neil Band, Joey Bose, Tim G. J. Rudner, Andrew McCallum

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: discrete diffusion models, masked language models, backpropagation through time, learned relay representations

TL;DR: Introduces a continuous latent channel in discrete diffusion models trained using truncated backpropagation through time.

Abstract: When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded—forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (RELAY), a method that allows MDMs to be "forward-thinking" when denoising—explicitly learning how to propagate latent information for the benefit of future denoising steps. RELAY introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in RELAY on a challenging Sudoku-based planning task. We then scale RELAY to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks by up to 3.7\% in accuracy and 32\% in inference latency. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments. Anonymized code is available at: https://anonymous.4open.science/r/relay-1D24

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 194

Loading