Keywords: Masked Diffusion Models, Diffusion Language Models, Direct Preference Optimization, Variance Reduction
TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.
Abstract: Diffusion large language models present a promising paradigm to language modeling, yet their alignment remains underexplored, particularly in systematic theoretical analysis and comprehensive empirical validation on general tasks. In this paper, we identify a primary challenge for this problem: the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, based on Direct Preference Optimization (DPO), we propose *Variance-Reduced Preference Optimization* (VRPO), a framework that formally analyzes the bias and variance of the preference optimization loss and gradient, showing both are governed by a score-estimator variance. Building on this foundation, we introduce multiple unbiased variance reduction strategies, including optimal budget allocation and antithetic sampling, to improve the alignment performance. We demonstrate the effectiveness of VRPO by applying it to LLaDA, a large-scale diffusion language model. The resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment (IFEval +4.0, Arena-Hard +4.3) benchmarks. Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to other strong language MDMs and ARMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5193
Loading