Learning Spatial-Temporal Graphs with Self-Attention Intensified Conditional Random Field for Video Person Re-identification

Wen-Hsien Fang, Rizard Renanda Adhi Pramono, Yie-Tarng Chen

Published: 2022, Last Modified: 27 Feb 2026MMSP 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper extends the structural graph pooling scheme for video-based person re-identification (re-ID). A temporal-aware feature extractor first employs the short-term temporal correlation of the fine-grained feature maps to generate a set of multi-scale part-based CNN features. Subsequently, a spatio-temporal graph is constructed for these multi-scale part-based features. A structural graph pooling scheme is then used to extract graph features for video re-ID. Specifically, the structured graph pooling is formulated as a node clustering problem based on the structural relationships of the multi-scale part features, addressed by a novel self-attention intensified conditional random field (CRF). Different from the original structural graph pooling approach, CRF is integrated with self-attention to leverage the strength of both schemes to provide long-term structural dependencies. Thereby, it can deal with the deficiency of the existing graph-based approaches on video re-ID in learning the diverse temporal dependency of the multi-scale part features. This enables similar body part information corresponding to the person of interest to be aggregated to diminish the adverse effect of redundant and background information. Simulations on two benchmark datasets showcase the effectiveness of the new method.
Loading