Language-Driven Cross-Domain Counting through Sparse and Dense Instance Points

Zhiqi Shao

Published: 30 Nov 2025, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Despite the success of foundation vision models, object counting remains highly specialized: methods trained for one setting often fail when transferred to another domain with different object appearance, density, or scale. We address this limitation by revisiting counting as a text-guided localization problem. Rather than predicting a density map or a single scalar, the proposed model receives an image together with a free-form category description and outputs instance-centered points corresponding to the queried objects. This representation provides both an interpretable count and spatial evidence for the prediction. We build a large unified benchmark for this task by harmonizing counting annotations from several domains, including everyday images, aerial scenes, medical and biological imagery, microorganisms, and agriculture. On top of this benchmark, we develop a dual-resolution counting network. A sparse instance branch proposes region-level anchors for visible, separated targets, while a dense point branch resolves crowded or weakly delineated objects at the pixel level. The two branches are trained with a shared point-centric objective that accommodates heterogeneous labels and are combined by a complementary fusion procedure. Comprehensive evaluation demonstrates that the model generalizes well across categories and domains, outperforming existing open-world counting approaches while retaining localized instance-level predictions.