Unifying Autoregressive and Discrete Diffusion Language Modeling via Cross-Regressive Decoding

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ReALM-GEN 2026 - ICLR 2026 WorkshopEveryoneRevisionsCC BY 4.0
Keywords: generative models, energy based, sequence modeling, non-autoregressive, MPC, discrete diffusion, language modeling, control theory, speculative decoding
TL;DR: We introduce "cross-regression," a semi-autoregressive framework that leverages inherent latent future token information for simpler parallel decoding, significantly accelerating text generation inference
Abstract: Inference acceleration can unintentionally change model behavior, complicating alignment-sensitive deployments where post-training (like RLHF) should be preserved. We introduce $\textbf{Cross-Regression}$, a decoding-time method that accelerates generation while providing an explicit mechanism to preserve or relax distributional fidelity. Cross-Regression augments a pretrained autoregressive transformer with a dual-stream design: a frozen control stream computes exact next-token probabilities, and a predictive stream proposes multi-token drafts in parallel. An energy-based acceptance test, derived from the per-token log probability ratio between control and predictive streams, determines how many proposed tokens can be safely committed. The method provides an explicit control between $\textit{lossless sampling}$ and a faster $\textit{lossy regime}$ with controllable deviation. Across models from 1.5B to 70B parameters, we observe strong scaling of acceptance length and realize $3–6\times$ speedups with near-complete quality retention across reasoning, code, and dialogue benchmarks, and we demonstrate modality transfer by accelerating Whisper decoding.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 72
Loading