Towards Subject-Consistent and Text-Aligned Personalized Image Generation via Precise Attribute Learning

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: personalized image generation; hierarchical feature extraction; text modulation
TL;DR: Address the trade-off of textual alignment and subject consistency in personalized image generation
Abstract: Recent advances in personalized image generation using Diffusion Transformers (DiTs) have shown remarkable progress. However, existing approaches face a trade-off between textual alignment and maintaining reference subjects. This issue primarily stems from the fact that directly injecting subject tokens may disrupt the sampling trajectory of the base model, while the methods through textual inversion struggle to capture detailed attributes of the subject. To address these limitations, we introduce a DiT based subject-driven generation framework Genova with an innovative attribute learning module. This attribute learning module integrates subject image tokens to improve the text-stream modulation, enhancing the representation of the subject's visual attributes distinctly. Contrary to traditional modulation techniques in DiTs, our proposed framework leverages the hierarchical features from the subject image tokens, facilitating more effective attribute learning. This enhancement allows for precise semantic understanding of the subject, thereby optimizing the model's inherent capabilities for textual alignment and enabling more flexible and controllable image generation. Moreover, we develop a synthetic dataset CoupleX featuring subject-paired samples that focus on depicting the activities and interactions within natural scenes, providing a richer context than previous datasets. Extensive experiments demonstrate that our method outperforms current state-of-the-art methods and achieves subject and prompt consistent personalized image generation.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3559
Loading