Abstract: Image customization involves learning the subject from provided concept images and generating it within textual contexts, typically yielding alterations of attributes such as style or background.
Prevailing methods primarily rely on fine-tuning technique, wherein a unified latent embedding is employed to characterize various concept attributes.
However, the attribute entanglement renders customized result challenging to mitigate the influence of subject-irrelevant attributes (e.g., style and background).
To overcome these issues, we propose Equilibrated Diffusion, an innovative method that achieves equilibrated image customization by decoupling entangled concept attributes from a frequency-aware perspective, thus harmonizing textual and visual consistency.
Unlike conventional approaches that employ a shared latent embedding and tuning process to learn concept, our Equilibrated Diffusion draws inspiration from the correlation between high- and low-frequency components with image style and content, decomposing concept accordingly in the frequency domain.
Through independently optimizing concept embeddings in the frequency domain, the denoising model not only enriches its comprehension of style attribute irrelevant to subject identity but also inherently augments its aptitude for accommodating novel stylized descriptions.
Furthermore, by combining different frequency embeddings, our model retains the spatially original customization capability.
We further design a diffusion process guided by subject masks to alleviate the influence of background attribute, thereby strengthening text alignment.
To ensure subject-related information consistency, Residual Reference Attention (RRA) is incorporated into the denoising model of spatial attention computation, effectively preserving structural details.
Experimental results demonstrate that Equilibrated Diffusion surpasses other competitors with better subject consistency while closely adhering to text descriptions, thus validating the superiority of our approach.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Generative Multimedia, [Experience] Multimedia Applications
Relevance To Conference: This paper proposes a novel text-guided image manipulation method, which falls within the domain of multimodal content generation and editing. Specifically, this method involves learning specific image concepts and generating learned concepts under new text guidance. Consequently, this work pertains to the transformation between image and text modalities, thereby advancing research in multimodal generation.
Supplementary Material: zip
Submission Number: 1677
Loading