Universal Point-Based Object Enumeration with Language-Conditioned Visual Queries

Zhiqi Shao

Published: 04 Aug 2025, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY-NC 4.0

Abstract: Object counting is usually studied through narrowly domain-specific datasets, with separate models and articular scenarios for crowds, vehicles, biological structures, remote-sensing objects, and aerial imagery. This fragmentation makes it difficult to build counting systems that can follow a user’s textual description and work reliably across visual domains. We formulate counting as text-guided point enumeration: given an image and a natural-language query, the model identifies instance-grounded target points, and the count is obtained directly from the number of predicted points. To study this problem at scale, we assemble a unified cross-domain counting corpus by converting diverse public datasets into a common image-query-point format, spanning natural scenes, remote sensing, microscopy, pathology, microorganisms, and crop imagery. We then introduce a generalist counting architecture that combines coarse region-level instance anchoring with fine pixel-level point prediction. The region-level sparse counter handles large and well-separated objects, while the complementary count fusion is designed for small, dense, and ambiguous targets. A point-based training scheme allows the model to learn from mixed annotation types, and a simple fusion CLOC rule integrates the two prediction streams without additional learned parameters. Experiments across cross-domain large-scale object counting show robust category-conditioned counting and improved generalization over recent open-vocabulary and open-world counting baselines.