Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Runpeng Yu; Xinyin Ma; Xinchao Wang

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Runpeng Yu, Xinyin Ma, Xinchao Wang

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Large Multimodal Model

Abstract: In this work, we present Dimple and Dimple+, two Discrete Diffusion Multimodal Large Language Models (dMLLMs). Dimple is initialized from a discrete diffusion Large Language Model (dLLM) without multimodal understanding ability, and learns such ability through a hybrid training paradigm that first applies autoregressive training and then switches to discrete diffusion training. Dimple+ is initialized from an autoregressive Multimodal Large Language Models, and acquires parallel decoding capability through pure discrete diffusion training. Both models achieve performance comparable to their autoregressive baselines, and Dimple+ establishes new state-of-the-art results among dMLLMs. To enhance inference efficiency, we propose Confident Decoding, which dynamically adjusts the number of tokens generated per iteration. Experiments show that it accelerates decoding by 2×–6× with only minor performance degradation. We also demonstrate that the Prefilling technique, previously used in autoregressive models, can be effectively applied to dMLLMs with bidirectional attention, achieving nearly lossless speedups of 1.7×–7×. Finally, we introduce the Structure Prior method, enabling fine-grained control over response format and reasoning structure, which is difficult to realize in autoregressive models.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4970

Loading