SAM3Count for Zero-Shot Open Vocabulary Counting in Images and Videos

Published: 01 Jun 2026, Last Modified: 01 Jun 2026CVPR 2026 Workshop WiCV Proceedings Track OralEveryoneRevisionsCC BY 4.0
Keywords: Open-Vocabulary Counting, Segmenting Anything Model 3, Zero-Shot Detection and Counting, Object Tracking
TL;DR: An Image and Video model built on SAM3 for Zero-Shot Object Counting and Tracking
Abstract: Open-vocabulary counting (OVC) has garnered significant attention for its ability to count without relying on manual exemplars or class-specific requirements. OVC requires prompt localization, scene understanding, and composition, as well as the ability to distinguish between instances of the same object type. Most methods rely on manually adding exemplars along with text prompts to aid localization and improve performance, but this comes at the cost of ease of use. In the video domain, OVC is equally important for real-time applications, given the challenges posed by occlusions, deformations, and fragmentation. We introduce SAM3Count, a zero-shot SAM3-based OVC framework for images and videos. For video counting, SAM3Count builds on SAM3 by designing a lightweight reidentification tracker that maintains an appearance bank to recover lost tracks and curb identity switches. For images, it uses adaptive ROI tiling to improve counting performance across diverse scenes without requiring manual exemplars or priors. SAM3Count achieves impressive results, surpassing the most recent state-of-the-art (SOTA) methods across image (FSCD-147, ShanghaiTech, CARPK) and video (TAO-Count, Penguins) benchmarks. . Code is available at https://github.com/Joan947/SAM3Count.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 27
Loading