Taming Text Alignment for Personalization: Disentangling Foreground Customization and Background Style in Diffusion Models for Personalized Image Generation
Keywords: Image customization, Style transfer, Diffusion models, Personalized image generation
Abstract: Personalized image generation focuses on synthesizing text-driven images conditioned on reference images aimed at specifying the details of the generated content. Typically, it involves two key subtasks: *image customization* and *style transfer*. Previous arts failed to align the generated image with the text prompts, specifying as misalignment with background for image customization and foreground for style transfer, owing to the dense image references that attend only to either the foreground or background, while overwhelming the sparse text prompts. To tackle the text-alignment for personalized image generation, in this paper, we propose a **D**ual-**R**eference **P**ersonalization diffusion model, dubbed **DRP-Diff**, for both image customization and style transfer to achieve the text alignment for personalized image generation, where the crux lies in disentangling the foreground customization and background style, so as to separately align each of them with text prompts during different stages of the denoising process. To achieve the alignment between text prompt and customized reference focusing on background, throughout our *customized texture-disentangled matrix*, we concatenate the foreground texture of both the customized and style references as key in the cross-attention to reconstruct the query background of the denoised personalized image. To align the style reference focusing on foreground with text prompts, we serve the background of style reference via *style-disentangled matrix* as the key associated with its value to reconstruct the query foreground of the denoised personalized image. Each of the above two processes are conducted in the early and late stage of the whole denoising process guided by text prompts. To adaptively modulate the denoising timesteps between the early and late stages, the ratio of information entropy between the customized and style references is calculated. Extensive experiments validate the superiority of DRP-Diff over the state-of-the-art diffusion models for personalized image generation. *Our code can be accessed from the supplementary material package.*
Supplementary Material: zip
Primary Area: generative models
Submission Number: 1182
Loading