Proximal Preference Optimization for Diffusion Models

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Diffusion model, Reinforcement learning, Preference optimization, RLHF
Abstract: Preference optimization techniques such as Reinforcement Learning From Human/AI Feedback(RLHF/RLAIF) emerge as the new standard approach in fine-tuning foundation models. Preference learning, however, is often optimized under the reinforcement learning setting which leads to a high variance, low data efficiency, as well as much longer steps to converge. Recent study of Direct Preference Optimization proved to be an effective way to mitigate such issues by converting the preference learning into a supervised learning paradigm for language models. However, little have been studied in the case of image generative models such as diffusion models. In this paper, we propose Proximal Preference Optimization for Diffusion models (PPOD) that extends the prior work with proximal constraints to solve the optimization challenges in diffusion model. We further study the online vs offline evaluation as well as the optimization objective choices and figure out the optimal setting for different use cases. Such a method makes preference optimization stable and feasible under the supervised learning setting. Our evaluation shows PPOD outperforms the other RL based reward optimization approaches on the stable diffusion model. To the best of our knowledge, we are the first work that enabled the efficient optimization for the RLAIF on the diffusion models.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8032
Loading