Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

ICLR 2026 Conference Submission20325 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Diffusion Models, Fine-tuning, Reinforcement Learning from Human Feedback, Deterministic Sampling, Human Preference Alignment
TL;DR: High stochasticity in Diffusion RLHF is also beneficial to ODE sampling
Abstract: Reinforcement Learning from Human Feedback (RLHF) improves pretrained generative models, and its sampling design is important for training reliable, high-quality models. In practice, stochastic SDE samplers promote exploration during training, while deterministic ODE samplers enable fast, stable inference; this creates a discrepancy in sampling stochasticity that induces a preference-reward gap. In this paper, we establish a non-vacuous bound on this gap for general diffusion models and a sharper bound for Variance Exploding (VE) and Variance Preserving (VP) models with (mixture) Gaussian data. Methodologically, we leverage the stochastic gDDIM scheme to attain arbitrarily high stochasticity while preserving data marginals, and we evaluate, under multiple preference rewards, the performance of RL algorithms (e.g., log-likelihood and group-relative policy variants). Our numerical experiments validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
Primary Area: reinforcement learning
Submission Number: 20325
Loading