Taming Text Alignment for Personalization: Disentangling Foreground Customization and Background Style in Diffusion Models for Personalized Image Generation

Haipeng Liu; Yang Wang; Biao Qian; Meng Wang

Taming Text Alignment for Personalization: Disentangling Foreground Customization and Background Style in Diffusion Models for Personalized Image Generation

Haipeng Liu, Yang Wang, Biao Qian, Meng Wang

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image customization, Style transfer, Diffusion models, Personalized image generation

Abstract: Personalized image generation focuses on synthesizing text-driven images conditioned on reference images aimed at specifying the details of the generated content. Typically, it involves two key subtasks: *image customization* and *style transfer*. Previous arts failed to align the generated image with the text prompts, specifying as misalignment with background for image customization and foreground for style transfer, owing to the dense image references that attend only to either the foreground or background, while overwhelming the sparse text prompts. To tackle the text-alignment for personalized image generation, in this paper, we propose a **D**ual-**R**eference **P**ersonalization diffusion model, dubbed **DRP-Diff**, for both image customization and style transfer to achieve the text alignment for personalized image generation, where the crux lies in disentangling the foreground customization and background style, so as to separately align each of them with text prompts during different stages of the denoising process. To achieve the alignment between text prompt and customized reference focusing on background, throughout our *customized texture-disentangled matrix*, we concatenate the foreground texture of both the customized and style references as key in the cross-attention to reconstruct the query background of the denoised personalized image. To align the style reference focusing on foreground with text prompts, we serve the background of style reference via *style-disentangled matrix* as the key associated with its value to reconstruct the query foreground of the denoised personalized image. Each of the above two processes are conducted in the early and late stage of the whole denoising process guided by text prompts. To adaptively modulate the denoising timesteps between the early and late stages, the ratio of information entropy between the customized and style references is calculated. Extensive experiments validate the superiority of DRP-Diff over the state-of-the-art diffusion models for personalized image generation. *Our code can be accessed from the supplementary material package.*

Supplementary Material: zip

Primary Area: generative models

Submission Number: 1182

Loading