Abstract: Recently, advancements in text-to-image synthesis and image customization have drawn significant attention. Among these technologies, foreground-driven image synthesis models aim to create diverse scenes for specific foregrounds, showing broad application prospects. However, existing foreground-driven diffusion models struggle to accurately generate scenes with layouts that align with user intentions. To address these challenges, we propose CompCraft, a training-free framework that enhances layout control and improves overall generation quality in current models. First, CompCraft identifies that the failure of existing methods to achieve effective control arises from the excessive influence of fully denoised foreground information on the generated scene. To address this, we propose a foreground regularization strategy that modifies the foreground-related attention maps, reducing their impact and ensuring better integration of the foreground with the generated scene. Then, we propose a series of inference-time layout guidance strategies to guide the image generation process with the user’s finely customized layouts. These strategies enable current foreground-driven diffusion models with accurate layout control. Finally, we introduce a comprehensive benchmark to evaluate CompCraft. Both quantitative and qualitative results demonstrate that CompCraft can effectively generate high-quality images with precise customized layouts, showcasing its strong capabilities in pratical image synthesis applications.
External IDs:dblp:journals/tcsv/GuoCNWL25
Loading