Keywords: Diffusion Large Multimodal Model
Abstract: In this work, we present Dimple and Dimple+, two Discrete Diffusion Multimodal Large Language Models (dMLLMs). Dimple is initialized from a discrete diffusion Large Language Model (dLLM) without multimodal understanding ability, and learns such ability through a hybrid training paradigm that first applies autoregressive training and then switches to discrete diffusion training. Dimple+ is initialized from an autoregressive Multimodal Large Language Models, and acquires parallel decoding capability through pure discrete diffusion training. Both models achieve performance comparable to their autoregressive baselines, and Dimple+ establishes new state-of-the-art results among dMLLMs. To enhance inference efficiency, we propose Confident Decoding, which dynamically adjusts the number of tokens generated per iteration. Experiments show that it accelerates decoding by 2×–6× with only minor performance degradation. We also demonstrate that the Prefilling technique, previously used in autoregressive models, can be effectively applied to dMLLMs with bidirectional attention, achieving nearly lossless speedups of 1.7×–7×. Finally, we introduce the Structure Prior method, enabling fine-grained control over response format and reasoning structure, which is difficult to realize in autoregressive models.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4970
Loading