Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma; Mu Zhang; Yibing Wang; Qixiang Ye

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: discrete diffusion, masked diffusion, math reasoning, image generation, reinforcement learning, GRPO

TL;DR: We introduce **MaskGRPO**, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations.

Abstract: Optimizing discrete diffusion model (DDM) with rewards remains a challenge—the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce **MaskGRPO**, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Across math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, **doubling** reinforcement learning gains while speeding up training by up to **30%**. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion. The code is available at https://github.com/martian422/MaskGRPO.

Primary Area: generative models

Submission Number: 1638

Loading