MirrorDiff: Prompt redescription for zero-shot grounded text-to-image generation with attention modulation
Abstract: Highlights•We propose a zero-shot grounded text-to-image-text framework for image generation.•We utilize Large Language Model as layout generator to generate scene layout.•We design a layout-guided attention modulation to mitigate the loss of small object.•We present semantic text regeneration supervision to align regenerated text and the input text.
External IDs:doi:10.1016/j.engappai.2025.110741
Loading