David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Weijian Luo; colin zhang; Debing Zhang; Zhengyang Geng

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Weijian Luo, colin zhang, Debing Zhang, Zhengyang Geng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduced a novel general score-based RLHF method and a SoTA 1-step Text-to-image generative model.

Abstract: We propose Diff-Instruct*(DI*), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI*-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI*-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community.

Lay Summary: We introduced a theory-driven post-training preference alignment method for one-step text-to-image models named **Diff-Instruct\* (DI\*)**. The **DI\* ** trained a 2.6B one-step text-to-image generative model at 1024x1024 resolution, which outperforms the 12B 50-step FLUX-dev model with significant margins. Our contribution can improve the efficiency as well as human preferences of text-to-image and video models.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/pkulwj1994/diff_instruct_star

Primary Area: Applications->Computer Vision

Keywords: One-step Text-to-image Generative Models, Reinforcement Learning, Human Feedback, Diffusion Models

Submission Number: 7929

Loading