Abstract: Video Individual Counting (VIC), which focuses on accurately tallying the total number of individuals in a video without duplication, is crucial for urban public space management and densely-populated areas planning. Existing methods suffer from limitations in terms of expensive manual annotation, and the efficiency of location or detection algorithms. In this work, we contribute a novel Prototype-guided Dual-Transformer Reasoning framework, termed PDTR, which takes both similarity and difference of adjacent frames into account to achieve accurate counting in an end-to-end regression manner. Specifically, we first design a multi-receptive field feature fusion module to acquire initial comprehensive representations. Subsequently, the dynamic prototype generation module memorizes consistent representations of similar information to generate prototypes. Additionally, to further dig out the shared and private features from different frames, a prototype cross-guided decoder and a privacy-decoupling module are designed. Extensive experiments conducted on two existing VIC datasets, consistently demonstrate the superiority of PDTR over state-of-the-art baselines.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Our work mainly focuses on multimedia application tasks, achieving accurate individual counting in videos. Video individual counting (VIC), as it involves counting each person in a video only once, presents a higher level of complexity than frame-by-frame video crowd counting (VCC). The core contributions of this work are as follows: (1) We propose a novel prototype-guided dual-transformer reasoning framework for VIC, which converts the feature matching process of conventional models to an end-to-end regression reasoning procedure. To the best of our knowledge, this represents the initial endeavor to employ Transformer in a dual-stream cross-guidance manner for VIC. (2) A novel dynamic prototype generation module is deployed to bridge and mine consistency information from comprehensive representations of adjacent frames, assisting decoder in cross-generating semantic consistency features, thereby reasonably utilizing the motion information of targets between frames to reduce duplicate counting. (3) Extensive experiments are conducted on two challenging benchmarks for video individual counting, which demonstrate: (a) the favorable comparison of our model with other state-of-the-art methods, and (b) the effectiveness of each module through ablation studies.
Submission Number: 3032
Loading