Abstract: Diffusion models excel at photorealistic synthesis but struggle with precise object counts, especially in high-density settings. We introduce COUNTLOOP, a training-free framework that achieves precise instance control through iterative, structured feedback. Our method alternates between synthesis and evaluation: a VLM-based planner generates structured scene layouts, while a VLM-based critic provides explicit feedback on object counts, spatial arrangements, and visual quality to refine the layout iteratively. Instance-driven attention masking and cumulative attention composition further prevent semantic leakage, ensuring clear object separation even in densely occluded scenes. Evaluations on COCO-Count, T2I-CompBench, and two newly introduced high instance benchmarks show that COUNTLOOP reduces counting error by up to 57% and achieves the highest or comparable spatial quality scores across all benchmarks, while maintaining photorealism.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=B7lXtvKImq
Changes Since Last Submission: The previous submission was desk-rejected for template non-compliance ("Modified template, please revisit and resubmit"). We have addressed all violations:
tmlr.sty: Restored one missing line (\lhead{Under review as submission to TMLR}) to match the official template exactly.
preamble.tex: Removed all layout-affecting overrides — global \captionsetup{skip=2pt}, \captionsetup[figure], \captionsetup[table],
\usepackage[font=small,labelfont=bf]{caption}, \setlength{\dbltextfloatsep}, and \setlength{\dblfloatsep}.
tmlr.tex: Removed \large from the title command.
Negative \vspace: Removed all active negative vertical spacing from both the main paper and supplementary.
Supplementary: Added a Broader Impact section.
No changes were made to experimental results, methodology, or claims.
Assigned Action Editor: ~Ning_Yu2
Submission Number: 9259
Loading