D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo; Olga Russakovsky; Ye Zhu

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo, Olga Russakovsky, Ye Zhu

01 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-image generation, numeracy enhancement, counting critic

TL;DR: We address the problem of accurately generating the correct count of objects by proposing a new way to convert robust detectors into differentiable critics, yielding the highest numeracy across low-density, single/multi-object, high-density settings.

Abstract: Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently \emph{differentiable}, thus excluding detector-based models, whose count-via-enumeration nature is \emph{non-differentiable}. To overcome this limitation, we propose \textbf{Detector-to-Differentiable} (\emph{D2D}), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object object scenarios) demonstrate consistent and substantial improvements in object counting accuracy, by up to 13.7\%, with minimal degradation in overall image quality and computational overhead.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 440

Loading