Towards Subject-Consistent and Text-Aligned Personalized Image Generation
via Precise Attribute Learning

๐Ÿ“ Abstract

Recent advances in personalized image generation using Diffusion Transformers (DiTs) have shown remarkable progress. However, existing approaches face a trade-off between textual alignment and maintaining reference subjects. This issue primarily stems from the fact that directly injecting subject tokens may disrupt the sampling trajectory of the base model, while the methods through textual inversion struggle to capture detailed attributes of the subject. To address these limitations, we introduce a DiT based subject-driven generation framework Genova with an innovative attribute learning module. This attribute learning module integrates subject image tokens to improve the text-stream modulation, enhancing the representation of the subject's visual attributes distinctly. Contrary to traditional modulation techniques in DiTs, our proposed framework leverages the hierarchical features from the subject image tokens, facilitating more effective attribute learning. This enhancement allows for precise semantic understanding of the subject, thereby optimizing the model's inherent capabilities for textual alignment and enabling more flexible and controllable image generation. Moreover, we develop a synthetic dataset CoupleX featuring subject-paired samples that focus on depicting the activities and interactions within natural scenes, providing a richer context than previous datasets. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods and achieves subject and prompt consistent personalized image generation.

๐Ÿ” Analysis


Comparison of our method Genova with two types of existing subject-driven generation methods.

(a) The first-type methods achieve subject-driven generation via token injection. These methods heavily depend on the subject image and face challenges in text alignment.

(b) The second-type methods achieve subject-driven generation via specialized text embeddings. These methods struggle with maintaining subject consistency.

(c) In contrast, our method achieves both subject consistency and text alignment through hierarchical attribute learning for enhanced modulation.

๐Ÿงช Method


(a) Overview of our proposed Genova framework. The text tokens, noisy image tokens, and subject image tokens are input into the DiT model. Each DiT block includes a proposed attribute learning module, followed by the modulation mechanism. Then the modulation offsets \(\Delta_{attribute}\), which reveals that the specific subject attributes are applied to enhance the control of the semantic text in MM-attention.

(b) Details of the attribute learning module. This module processes subject image tokens (from block i-1) and attribute tokens through subject-driven self-attention, which enhances the semantic understanding of the attribute tokens by incorporating hierarchical texture features from the subject image.