DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Yatong Bai; Jonah Casebeer; Somayeh Sojoudi; Nicholas J. Bryan

DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan

Published: 06 Nov 2025, Last Modified: 06 Nov 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Fréchet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Example generations can be found at https://ml-dragon.github.io/web/.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Dear Esteemed Action Editor and Reviewers, Thank you for your further thoughtful feedback and comments. We are thrilled with the recommendation of accept. In addition to our rebuttal submission that includes both commentary and an updated draft with highlighted changes, we have continued to make minor updates to refine our draft to further address the requested comments. Our overall changes explicitly address the feedback on notations and equations, algorithm 1 clarity, presentation polish, and reproducibility. - For notations and equations, we have made comments about distributional rewards being computed on finite sampled sets in the abstract, intro, and body of the work. We seek to still formulate our problem on continuous distributions in Section 3.0, but then approximate it using finite sets in Section 3.1, but have made further clarifying comments in Section 3.1 to make this clearer. We have also removed any recover phrasing, and updated the prose in Section 3.2 to clearly state our surrogate objectives. - For Algorithm 1 clarity, the iteration and indexing variable are the same. We have clarified this in the algorithm and in the prose. We have also further explained how this differs from simpler instance-wise rewards, and why for non-decomposable rewards. We added a sentence to explain why greedy swaps are necessary. - For presentation polish, we have made sure all figures have full captions, fixed Figure 1, and made the prior work, gap, and contribution explicit in the second paragraph of the introduction. - For reproducibility, we have confirmed the explicit inclusion of implementation details, such as dataset descriptions, batch sizes, and compute budgets. We have also clarified that we use identical DRAGON hyperparameters for every reward type, and how we chose them. To facilitate reproducibility, we have included a piece of training loop pseudocode in Appendix F. The `compare_r_dist` function forms the core of distribution-level reward optimization.

Assigned Action Editor: ~Jiang_Bian1

Submission Number: 4707

Loading