Keywords: Text-to-Image Synthesis, Diffusion Model, CoT
Abstract: Recent advancements in Large Language Models (LLMs), Large Multi-Modal Models (LMMs), and text-to-image generation have significantly improved multimodal understanding and generation. However, a fundamental gap remains between human drawing processes and the iterative denoising mechanisms of existing diffusion-based models, leading to structural inaccuracies, prompt inconsistencies, and factual errors. To address this, we propose CoTDiff, a novel diffusion-based multi-stage image synthesis framework that integrates Chain-of-Thought (CoT) reasoning. This approach introduces two forms of CoT: textual CoT, where an LLM depicts the image layout based on the prompt, and diffusion CoT, which generates images in multiple stages—edge maps, grayscale images, and colorful images—mimicking the human drawing process.
CoTDiff leverages a feature insertion mechanism to harmonize these stages, effectively reducing conflicts and improving consistency. Empirical results demonstrate that CoTDiff outperforms existing text-to-image methods, particularly in complex tasks requiring accurate object counting and spatial control. By bridging the gap between human drawing and machine generation, CoTDiff offers a fresh perspective on integrating CoT into image synthesis and unlocks the latent potential of diffusion models to produce high-quality, detailed, and coherent images.
Primary Area: generative models
Submission Number: 8431
Loading