Abstract: Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual
images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak
smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations
between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to
impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance
without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow
to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner
makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly
reduces the annotation cost while still leading to similar performance to the full supervision case.
0 Replies
Loading