everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are winning tickets that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those winning noises. To learn winning noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the $\textit{noise prompt}$, which aims at turning a random Gaussian noise into a winning noise ticket by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the $\textit{noise prompt learning}$ framework that systematically learns "prompted'' winning noise tickets associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale $\textit{noise prompt dataset}$ (NPD) that contains 100k pairs of random noises and winning noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small $\textit{noise prompt network}$ (NPNet) that can directly learn to transform a random noise ticket into a winning noise ticket. The learned winning noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a winning noise instead of a random noise without accessing the original pipeline.