Keywords: consistent generation
Abstract: Recent advancements in text-to-image (T2I) generation have significantly improved image quality and text alignment. However, generating multiple coherent images that maintain consistent character identities across diverse textual descriptions remains challenging. Existing methods face trade-offs between identity consistency and per-image text fidelity, often yielding uniform poses or failing to capture specific details, resulting in inconsistent performance.
In this paper, we explore text embeddings of word and PAD tokens from the scene descriptions, and ambiguity of the identity description. We find the identity and irrelevant components from the text embeddings to amplify and suppress them, respectively. Additionally, we detect under-specified identity descriptions and reuse their features during generative process. Finally, we introduce a unified evaluation protocol, the Consistency Quality Score (CQS), integrating identity preservation and per-image text alignment into a single comprehensive metric. CQS explicitly captures performance imbalances, aligning evaluation closely with human perceptual preferences.
Our framework achieves state-of-the-art performance, effectively resolving prior trade-offs and providing valuable insights into consistent image generation.
Primary Area: generative models
Submission Number: 58
Loading