Abstract: Recent advances in deep learning have led to significant breakthroughs across various fields, including both image and video analysis. However, single-image and single-video approaches have inherent limitations in their viewpoints. In particular, surveillance camera analysis often relies on individual cameras, even when multiple cameras cover the same location, limiting the scope of information available from a single viewpoint. To address these limitations, we focus on pedestrian detection and tracking in multi-camera settings, which can help mitigate issues of person occlusion and missed detections that are common in single-view detection and tracking tasks.
In tasks involving multiview inputs, inter-view correlations offer valuable contextual information. To utilize this advantage, we propose a novel approach comprising a consistency loss and a feature aggregation module designed to enhance multiview correlations. First, we introduce a Multi-Pedestrian Consistency Loss to maintain feature coherence for individuals appearing across different views. Second, we propose the Multi-Camera Feature Aggregation module, which captures broader contextual information from surrounding areas using a large receptive field.
Our proposed methods are evaluated on the WildTrack and MultiViewX datasets, where it achieves state-of-the-art performance across key metrics in both detection and tracking tasks.
Loading