Dual-Granularity Point Prediction for Text-Prompted Object Counting

Yuxuan Yuan

Published: 21 Apr 2026, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Density map regression has long been the de facto standard for visual object counting. However, it often struggles with cross-domain generalization and fails to provide precise instance-level localization. Furthermore, counting datasets remain heavily siloed, spanning distinct fields such as pathology, agriculture, aerial imagery, and crowd analysis, making it difficult to train universal models. In this work, we propose a text-guided generalist object counting framework that abandons density maps in favor of an instance-grounded set of target points. To bridge the gap between varying object scales and densities across domains, our architecture employs a category-conditioned counting strategy. A sparse, region-level sparse counter predictor anchors large or isolated targets, while a dense, pixel-level predictor resolves small, heavily clustered, or weakly bounded instances. We also introduce a cross-domain large-scale object counting, unified benchmark harmonizing over 15 million point annotations from heterogeneous visual domains. By training on this unified corpus with a pixel-level dense counter loss formulation, our model learns to seamlessly fuse sparse and dense predictions based on free-form language queries by complementary count fusion. Comprehensive evaluations confirm that this point-based approach significantly outperforms existing open-world counting methods in both multi-domain accuracy and spatial interpretability