Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Discrete Diffusion Models, Policy gradient algorithms, Non-differentiable rewards, Fine-Tuning, Reinforcement Learning from Human Feedback
TL;DR: We propose a policy gradient algorithm for fine-tuning discrete diffusion models over non-differentiable rewards.
Abstract: Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 7660
Loading