Abstract: Diffusion models have achieved remarkable progress in photorealistic synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, especially in complex, high-density settings. We introduce CountLoop, a training-free framework that equips diffusion models with accurate instance control via iterative structured feedback. It alternates between image synthesis and multimodal agent evaluation: an LLM-guided layout planner and critic provide explicit feedback on object counts, spatial arrangements, and attribute consistency, which is used to refine scene layouts and guide subsequent generations. Instance-driven attention masking and compositional techniques further prevent semantic leakage, enabling clear separation of individual objects even in occluded scenes. Evaluations on COCO-Count, T2I-CompBench, and two newly introduced high-instance benchmarks demonstrate that CountLoop surpasses existing benchmarks by achieving a counting accuracy of as much as 98% while consistently acing spatial arrangement and visual quality over existing layout and gradient‑guided baselines with a score of 0.97.
Loading