Keywords: Diffusion Large Language Models, Reinforcement Learning, Inpainting, Group Relative Policy Optimization
TL;DR: IGPO, an RL method for diffusion LLMs that uses inpainting to inject partial reasoning hints when stuck with all-wrong responses, achieving SoTA results on math benchmarks for masked diffusion LLMs
Abstract: Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs by addressing a key challenge: sparse reward signals and sample waste when LLMs fail to discover correct solutions. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically injects partial ground-truth reasoning traces during online sampling to guide exploration toward promising trajectory spaces while preserving self-generated reasoning. Applied to group-based optimization methods like GRPO, IGPO restores meaningful gradients when exploration failures cause zero advantages. Combined with supervised fine-tuning on synthetically rewritten concise traces and entropy-based filtering, our approach achieves state-of-the-art performance on four mathematical benchmarks across full-attention based dLLMs.
Submission Number: 172
Loading