Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting

Published: 26 Oct 2025, Last Modified: 26 Feb 2026The 33rd ACM International Conference on MultimediaEveryoneCC BY 4.0
Abstract: Video Individual Counting (VIC), which seeks to count unique indi- viduals across video sequences without duplication, has broader ap- plications than traditional Video Crowd Counting (VCC), including urban planning, event management, and safety monitoring. How- ever, although current VIC approaches have demonstrated strong capabilities, their reliance on identity-level or group-level annota- tions necessitates substantial labeling effort and expense. To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting in the feature space rather than the image space, F-ShiftMix gen- erates realistic crowd motion without explicit annotations, while preserving global semantic coherence. Furthermore, VIC-SSL in- tegrates the Cost-guided Flow Prompt (CFP) and the Distinction- aware Cross-Attention (DCA) to enhance flow-aware localization and inter-frame correspondence learning. Our extensive experi- ments across three datasets, including SenseCrowd, CroHD, and CARLA, demonstrate that VIC-SSL substantially outperforms exist- ing methods, achieving state-of-the-art results with significantly re- duced data requirements. These results showcase VIC-SSL’s poten- tial to dramatically lower annotation costs and improve the deploy- ment feasibility of VIC systems in complex scenarios. The project website is available at https://leohuang0511.github.io/vic-ssl.
Loading