A Two-Stream Dynamic Pyramid Representation Model for Video-Based Person Re-Identification

Xi Yang, Liangchen Liu, Nannan Wang, Xinbo Gao

2021 (modified: 16 Nov 2022)IEEE Trans. Image Process. 2021Readers: Everyone

Abstract: Video-based person re-identification (Re-ID) leverages rich spatio-temporal information embedded in sequence data to further improve the retrieval accuracy comparing with single image Re-ID. However, it also brings new difficulties. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1) Both spatial and temporal information should be considered simultaneously. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2) Pedestrian video data often contains redundant information and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3) suffers from data quality problems such as occlusion, background clutter. To solve the above problems, we propose a novel <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">two-stream <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Dynamic Pyramid Representation Model (DPRM) . DPRM mainly consists of three sub-models, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e. , Pyramidal Distribution Sampling Method (PDSM), Dynamic Pyramid Dilated Convolution (DPDC) and Pyramid Attention Pooling (PAP). PDSM is applied for more effective data pre-processing according to sequence semantic distribution. DPDC and PAP can be considered as two streams to describe the motion context and static appearance of a video sequence, respectively. By fusing the two-stream features together, we finally achieve comprehensive spatio-temporal representation. Notably, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dynamic pyramid strategy is applied throughout the whole model. This strategy exploits multi-scale features under attention mechanism to maximally capture the most discriminative features and mitigate the impact of video data quality problems such as partial occlusion. Extensive experiments demonstrate the outperformance of DPRM. For instance, it achieves 83.0% mAP and 89.0% Rank-1 accuracy on MARS dataset and reaches state of the art.

0 Replies