Your Autoregressive Visual Model is a Natively Multi-Token Predictor : Speculative Coupled Decoding for Fast Autoregressive Visual Generation

Junhyuk So; Hyunho Kook; Chaeyeon Jang; Eunhyeok Park

Your Autoregressive Visual Model is a Natively Multi-Token Predictor : Speculative Coupled Decoding for Fast Autoregressive Visual Generation

Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Autoregressive Visual Models, Speculative Decoding, Training-Free

Abstract: Autoregressive (AR) modeling has recently emerged as a promising new paradigm in visual generation, but its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. While several Speculative Decoding (SD)-based methods have been proposed to solve this problem by generating multiple tokens in a single forward step, they suffer from limited speedup, degraded quality, or require the training of a draft model. To solve these problems, we propose a new training-free, lossless SD framework, Speculative Coupled Decoding (SCD), by extending the recently proposed Speculative Jacobi Decoding (SJD). While SJD shows strong potential for accelerating AR generation by combining Jacobi iteration and SD, we found that its acceptance rate is still significantly limited due to the instability arising from the independent sampling process used during draft token generation. To overcome this, we introduce an information-theoretic approach, Coupling, which stabilizes the drafting trajectory of SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, significantly enhancing the acceptance rate while preserving its lossless property. Remarkably, this method can be applied to any AR model without any training or overhead, yet achieves substantial performance gains, delivering up to a 4.2× speedup in image generation and a 13.6× speedup in video generation compared to standard AR decoding, with zero quality degradation.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 38

Loading