Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minghui Hu; Chuanxia Zheng; Zuopeng Yang; Tat-Jen Cham; Heliang Zheng; Chaoyue Wang; Dacheng Tao; Ponnuthurai N. Suganthan

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minghui Hu, Chuanxia Zheng, Zuopeng Yang, Tat-Jen Cham, Heliang Zheng, Chaoyue Wang, Dacheng Tao, Ponnuthurai N. Suganthan

Published: 01 Feb 2023, Last Modified: 22 Jun 2025ICLR 2023 posterReaders: Everyone

Keywords: Multi-modal, Image generation, Image Caption.

Abstract: The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Generative models

TL;DR: We proposed Unified Discrete Denoising Diffusion model, which allows us to construct a joint vision-language probability distribution, leading to a capability of simultaneously generating cross-domain results.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/unified-discrete-diffusion-for-simultaneous/code)

29 Replies

Loading