RealignDiff: Boosting text-to-image diffusion model with coarse-to-fine semantic re-alignment

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeX
Keywords: Text-to-Image Diffusion, Text-Image Alignment, Semantic Re-alignment
Abstract: Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images based on given text prompts. However, previous methods have struggled to establish precise modality alignment between text concepts and generated images due to the absence of successful semantic guidance that can diagnose the modality discrepancy. In this paper, we propose a two-stage coarse-to-fine semantic realignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. During the coarse semantic realignment stage, we introduce a caption reward from a global semantic perspective, which is incorporated into the reward feedback learning framework to optimize the text-to-image diffusion model based on the caption reward score. Specifically, the caption reward generates a detailed caption that corresponds to all crucial content in the synthetic image using a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. In the fine semantic realignment stage, we propose a local dense caption generation module from the local semantic perspective to refine the previously generated images. The local dense caption module provides the mask, detailed caption, and the corresponding likelihood score for each object appearing in the generated image. Furthermore, we also introduce an attention modulation method to guide the text-to-image diffusion model in realigning the generated captions with the object masks of the generated images. Experimental results on the MS-COCO benchmark demonstrate that the proposed two-stage coarse-to-fine semantic realignment method outperforms other baseline realignment techniques by a substantial margin in both visual quality and semantic similarity with the input prompt.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1638
Loading