Abstract: This paper presents a novel approach to the task of videobased crowd counting, which can be formalized as the regression problem
of learning a mapping from an input image to an output crowd density
map. Convolutional neural networks (CNNs) have demonstrated striking accuracy gains in a range of computer vision tasks, including crowd
counting. However, the dominant focus within the crowd counting literature has been on the single-frame case or applying CNNs to videos in a
frame-by-frame fashion without leveraging motion information. This paper proposes a novel architecture that exploits the spatiotemporal information captured in a video stream by combining an optical flow pyramid
with an appearance-based CNN. Extensive empirical evaluation on five
public datasets comparing against numerous state-of-the-art approaches
demonstrates the efficacy of the proposed architecture, with our methods
reporting best results on all datasets.
Loading