Beyond Text-to-Image: Liberating Generation with a Unified Discrete Diffusion Model

Beyond Text-to-Image: Liberating Generation with a Unified Discrete Diffusion Model

ICLR 2026 Conference Submission18915 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: discrete diffusion, unified model

Abstract: Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Leveraging efficient token-level discrete denoising, strong visual priors, and a lightweight text decoder, Muddit supports flexible, high-quality generation with a compact architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger AR-based models, in both quality and speed. The work highlights the potential of pure discrete diffusion as a scalable and effective backbone for multimodal generation. Code and models will be available.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 18915

Loading