Keywords: Few-shot object Counting, Training-free, SAM, Output Token, Probability Distribution
TL;DR: This paper proposes a novel training-free framework for object counting that uses the SAM without any modifications. The key innovation is directly analyzing SAM's internal 'output tokens' to accurately identify and count objects.
Abstract: Object counting is a critical computer vision task with widespread applications in manufacturing, traffic monitoring, and crowd analysis. Recent class-agnostic object counting methods leveraging the Segment Anything Model (SAM) are limited by the inherent uncertainty of the similarity metric derived from its image encoder. While solutions incorporating additional encoders can refine this similarity, they face challenges due to high computational costs. To overcome this challenge, we propose a novel framework that consists of two critical components working in synergy with SAM. We propose a probabilistic prompt generation stage and an output token-based verification stage. The probabilistic prompt generation stage efficiently generates prompts based on probability distributions from SAM's image embedding, while the output token-based verification stage uses SAM's output tokens to effectively distinguish between positive and negative instances. Experimental results show our method achieves superior accuracy with an MAE of 16.25, outperforming existing training-based and training-free counting methods. Notably, our method achieves comparable performance to training-free approaches that require additional models alongside foundation models. Particularly, on the CARPK dataset, our method achieves superior performance, outperforming all supervised methods and demonstrating comparable results against training-free counting methods. Furthermore, ablation studies prove that this performance gain is critically attributed to two key components. This study not only presents an effective solution for object counting but also showcases the potential of applying foundation models to downstream tasks without fine-tuning and additional models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10378
Loading