Unsupervised Multi-view Pedestrian Detection

Mengyin Liu; Chao Zhu; Shiqi Ren; Xu-Cheng Yin

Unsupervised Multi-view Pedestrian Detection

Mengyin Liu, Chao Zhu, Shiqi Ren, Xu-Cheng Yin

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the prosperity of the intelligent surveillance, multiple cameras have been applied to localize pedestrians more accurately. However, previous methods rely on laborious annotations of pedestrians in every frame and camera view. Therefore, we propose in this paper an Unsupervised Multi-view Pedestrian Detection approach (UMPD) to learn an annotation-free detector via vision-language models and 2D-3D cross-modal mapping: 1) Firstly, Semantic-aware Iterative Segmentation (SIS) is proposed to extract unsupervised representations of multi-view images, which are converted into 2D masks as pseudo labels, via our proposed iterative PCA and zero-shot semantic classes from vision-language models; 2) Secondly, we propose Geometry-aware Volume-based Detector (GVD) to end-to-end encode multi-view 2D images into a 3D volume to predict voxel-wise density and color via 2D-to-3D geometric projection, trained by 3D-to-2D rendering losses with SIS pseudo labels; 3) Thirdly, for better detection results, i.e., the 3D density projected on Birds-Eye-View, we propose Vertical-aware BEV Regularization (VBR) to constrain pedestrians to be vertical like the natural poses. Extensive experiments on popular multi-view pedestrian detection benchmarks Wildtrack, Terrace, and MultiviewX, show that our proposed UMPD, as the first fully-unsupervised method to our best knowledge, performs competitively to the previous state-of-the-art supervised methods. Code is available at https://github.com/lmy98129/UMPD.

Primary Subject Area: [Content] Vision and Language

Relevance To Conference: On the one hand, to our best knowledge, this work is the first to apply vision-language model in the multi-view pedestrian detection task, to obtain segmentation masks as pseudo labels. One the other hand, to predict the pedestrian bird-eyes-view density as 3D modality, the multi-view images as 2D modality are encoded by our proposed detector into a 3D volume via 2D-3D cross-modal mapping. For both training and inference, this work is highly relevant to multiple modalities including 2D, 3D, and language to realize unsupervised multi-view pedestrian detection as a novel task.

Supplementary Material: zip

Submission Number: 4673

Loading