Optimizing Visual Generative Models with Distribution-wise Rewards

Ruihang Li; Mengde Xu; Shuyang Gu; Leigang Qu; Fuli Feng; Han Hu; Wenjie Wang

Optimizing Visual Generative Models with Distribution-wise Rewards

Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu, Fuli Feng, Han Hu, Wenjie Wang

04 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Generative Models, Diffusion Models, Reinforcement Learning, Distribution-wise Rewards

TL;DR: We propose an efficient method to train visual generative models using reinforcement learning with distribution-wise rewards (instead of sample-wise), which reduces artifacts, improves diversity, and achieves better FID scores.

Abstract: Reinforcement learning (RL) for visual generative models often relies on sample-wise reward functions, which can incentivize reward hacking, leading to visual artifacts and reduced diversity. In this work, we propose a novel approach that utilizes distribution-wise rewards to guide visual generative models in learning the real-world image distribution more accurately. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 2110

Loading