Pinpointing Crowd in Bird's Eye View via Proximal Contexts

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-human BEV localization
Abstract: Bird’s-Eye-View (BEV) perception has emerged as a promising paradigm for scene understanding and it has recently been applied for multi-human activity analysis. We formulate multi-human BEV standing localization as a regression task under severe inter-occlusion in crowded scenes. To address this, we propose a unified dual-space collaborative learning framework, denoted as BEVCrowdLocator, that jointly reasons over image and BEV spaces to augment monocular inference. We incorporate standing point queries that explicitly model spatial relationships between adjacent individuals as proximal contexts, disambiguating foot-keypoint localization via proximity-aware attention. Additionally, we present the proximity-aware suppression mechanism that prioritizes consensus predictions during regression, improving robustness to occlusions. We validate our model's performance in terms of social distance measurement on the real-world surveillance benchmark. Moreover, we quantify the BEV localization performance on real soccer videos, demonstrating its potential in sports analysis.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5855
Loading