Abstract: Dense crowd counting is one of the challenging problems where creating large labeled datasets turns out to be difficult. Typical crowd images have thousands of people positioned close to each other and annotating the locations of every person is tedious. Add to these the growing need to include crowds from as many diverse scenarios as possible for better generalization. In this context, labeling every head for various settings under consideration is not scalable and directly affects the performance of deep models on account of limited data. We mitigate this issue with a new binary labeling scheme. Every image is simply labeled to either dense or sparse crowd category, instead of annotating every single person in the scene. This leads to dramatic reduction in the amount of annotations required and becomes proportional to the number of images rather than the crowd count. For training counting models, we create noisy density maps directly from the edge density of the images, which are then improved through rectifier networks. There are separate rectifier networks for dense and sparse categories, trained in an unsupervised fashion. The proposed counting model is composed of a self-supervised backbone feature network and a regressor head. The ground truth density maps are generated using the binary labels and the rectifier networks for training the regressor. Experiments show that the proposed architecture achieves competitive performance than existing models at an extremely low annotation cost.
0 Replies
Loading