High-density Visual Crowd Counting with Perspective Understanding in Deep Neural Networks

Muming Zhao

Published: 01 Jan 2020, Last Modified: 30 Jan 2025undefined 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: : With population growth and worldwide urbanization, crowd gathering in public places has become more common. Thus estimating the number of people and measuring their density has become essential for practical applications such as physical security control and public space management. However, the complex environments of crowded scenes have imposed several challenges to general counting algorithms, among which scale variations of pedestrians is one of the most significant problems. With varying-sized objects, it is rather difficult for density-based counting systems to generate appropriate density estimations that conform to scale variations, which usually significantly degrades the counting accuracy. To handle the perspective distortion and the related scale-variation problem, traditional methods mainly perform feature normalization for perspective correction. However, within the deep learning framework, the perspective distortion has not been explicitly considered and addressed. Can we extend the mechanism of perspective handling with the powerful deep learning technique for further improvement? In this dissertation, we focus on measuring crowd density through deep architectures with in-network perspective understanding. Three works are presented. First, we develop a depth-embedded network that augments the original features to be scale-aware for more accurate density estimation. The depth map of a scene is encoded, rectified and finally embedded into the network via a proposed depth embedding module. Thus the objects, although in the same class, will attain distinct representations according to their scales in the feature space, which will directly benefit scale-aware density estimations. We include a comprehensive comparison with various state-of-the-art methods for the task of crowd counting to verify the efficacy of incorporating geometric priors. Second, a multi-stage model with region-based supervisions is constructed to obtain robust features with implicit understandings of the scene geometry. With the internal multi-stage learning mechanism, features could be refined and adjusted repeatedly to perceive the scale variations. Besides, with local-based supervisions, the model is further constrained to generate locally consistent densities that conform to object scale variations. Experiments are presented to validate the effectiveness of the proposed model for crowd counting. Third, we build a multi-task framework that drives the network to embed desired semantic/geometric/numeric attributes to handle various type of challenges for crowd counting. With the multi-fold regularization effects introduced by three auxiliary tasks, the intermediate features are driven to convey desired properties and thus help improve the main task of density estimation. Extensive experiments have been conducted to indicate the effectiveness of the proposed method.