LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu; Rongzhen Wang; Shen Nie; Xiaolu Zhang; Chunwei Wu; Jun Hu; JUN ZHOU; Jianfei Chen; Yankai Lin; Ji-Rong Wen; Chongxuan Li

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, JUN ZHOU, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li

14 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Masked Diffusion Models, Diffusion Language Models, Direct Preference Optimization, Variance Reduction

TL;DR: We propose VRPO to reduce gradient variance and improve preference alignment in masked diffusion language models.

Abstract: Diffusion large language models present a promising paradigm to language modeling, yet their alignment remains underexplored, particularly in systematic theoretical analysis and comprehensive empirical validation on general tasks. In this paper, we identify a primary challenge for this problem: the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. To address this issue, based on Direct Preference Optimization (DPO), we propose *Variance-Reduced Preference Optimization* (VRPO), a framework that formally analyzes the bias and variance of the preference optimization loss and gradient, showing both are governed by a score-estimator variance. Building on this foundation, we introduce multiple unbiased variance reduction strategies, including optimal budget allocation and antithetic sampling, to improve the alignment performance. We demonstrate the effectiveness of VRPO by applying it to LLaDA, a large-scale diffusion language model. The resulting model, LLaDA 1.5, outperforms its SFT-only predecessor consistently across mathematical (GSM8K +4.7), code (HumanEval +3.0, MBPP +1.8), and alignment (IFEval +4.0, Arena-Hard +4.3) benchmarks. Furthermore, LLaDA 1.5 demonstrates a highly competitive mathematical performance compared to other strong language MDMs and ARMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5193

Loading