Abstract: Video Individual Counting (VIC), which seeks to count unique indi-
viduals across video sequences without duplication, has broader ap-
plications than traditional Video Crowd Counting (VCC), including
urban planning, event management, and safety monitoring. How-
ever, although current VIC approaches have demonstrated strong
capabilities, their reliance on identity-level or group-level annota-
tions necessitates substantial labeling effort and expense. To reduce
the high costs of manual annotation, we introduce VIC-SSL, a novel
self-supervised learning approach that utilizes unlabeled data along
with the innovative feature-level augmentation technique called
Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting
in the feature space rather than the image space, F-ShiftMix gen-
erates realistic crowd motion without explicit annotations, while
preserving global semantic coherence. Furthermore, VIC-SSL in-
tegrates the Cost-guided Flow Prompt (CFP) and the Distinction-
aware Cross-Attention (DCA) to enhance flow-aware localization
and inter-frame correspondence learning. Our extensive experi-
ments across three datasets, including SenseCrowd, CroHD, and
CARLA, demonstrate that VIC-SSL substantially outperforms exist-
ing methods, achieving state-of-the-art results with significantly re-
duced data requirements. These results showcase VIC-SSL’s poten-
tial to dramatically lower annotation costs and improve the deploy-
ment feasibility of VIC systems in complex scenarios. The project
website is available at https://leohuang0511.github.io/vic-ssl.
Loading