Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Tao Zhang; Cheng Da; Kun Ding; Huan Yang; kun jin; Yan Li; Tingting Gao; Di ZHANG; Shiming Xiang; Chunhong Pan

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Tao Zhang, Cheng Da, Kun Ding, Huan Yang, kun jin, Yan Li, Tingting Gao, Di ZHANG, Shiming Xiang, Chunhong Pan

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model, Reward Model, Preference Optimization, Image Generation

TL;DR: Leveraging the pre-trained diffusion model as a powerful and cost-effective step-level reward model to optimize the diffusion model itself directly in the noisy latent space.

Abstract: Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the **Latent Reward Model (LRM)**, which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce **Latent Preference Optimization (LPO)**, a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6653

Loading