Grounded Visual Counting with Open-Vocabulary Queries: A Unified Approach for Heterogeneous Datasets

Yuxuan Yuan

Published: 31 Dec 2025, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Generalizing object counting to arbitrary categories and unseen domains poses a major challenge, as real-world targets range from microscopic cells and crops to urban crowds and vehicles. Traditional methods rely on domain-specific architectures and closed-set labels, severely limiting their applicability. We present a generalist, language-driven counting model designed to enumerate any target specified by a text query, outputting precise instance-level spatial coordinates rather than global scalars. To facilitate model training and cross-domain evaluation, we compile a massive dataset that aggregates disparate counting benchmarks into a single, standardized format covering natural, biological, and remote-sensing imagery. At the core of our method is a hybrid point-prediction mechanism that handles the extreme variance in object density and scale. It couples a sparse object-level counter for distinct instances with a dense pixel-level counter for heavily clustered regions. These two representations are supervised by a shared point-centric objective and merged through a parameter-free fusion module. Results across six distinct visual domains demonstrate that our framework not only provides strong open-vocabulary counting performance but also ensures reliable spatial grounding compared to traditional map-based regression techniques.