Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

Weijian Luo; colin zhang; Debing Zhang; Zhengyang Geng

Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models

Weijian Luo, colin zhang, Debing Zhang, Zhengyang Geng

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: one-step text-to-image generative model, human preference alignment, RLHF

TL;DR: We introduce a state-of-the-art text-to-image generation that is aligned with human preferences.

Abstract: In this paper, we introduce Diff-Instruct* (DI*), an image data-free approach for building one-step text-to-image generative models that align with human preference while maintaining the ability to generate highly realistic images. We frame human preference alignment as online reinforcement learning using human feedback (RLHF), where the goal is to maximize the reward function while regularizing the generator distribution to remain close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization, which leads to significantly better performances. Although the direct calculation of this divergence remains intractable, we demonstrate that we can efficiently compute its gradient by deriving an equivalent yet tractable loss function. Remarkably, with Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms all previously leading models by a large margin. When using the 2.6B Stable Diffusion XL architecture, the DI* results in a solid human-preferred one-step model that is able to generate aesthetic images of $1024\times 1024$ resolutions. When using the 0.6B PixelArt-α model as the reference diffusion, DI* achieves a new record Aesthetic Score of 6.30 and an Image Reward of 1.31 with only a single generation step, almost doubling the scores of the rest of the models with similar sizes. It also achieves an HPSv2 score of 28.70, establishing a new state-of-the-art benchmark. We also observe that DI* can improve the layout and enrich the colors of generated images. Our best human-preferred one-step generator will be released with this paper.

Supplementary Material: pdf

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8677

Loading