Recursively learning fine-grained spatial-temporal features for video-based person Re-identification

Published: 01 Jan 2025, Last Modified: 11 Apr 2025Eng. Appl. Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video-based person Re-identification (Re-ID) is a field within artificial intelligence that aims to match the same person under different cameras. Its core challenge lies in how to effectively reinforce the spatial–temporal feature learning. Existing methods usually perform careful spatial feature enhancement and simple temporal feature aggregation, or the opposite, thus these methods may easily miss fine-grained temporal or spatial clues due to occlusion. To address the above issues, we propose a recursive spatial–temporal feature learning framework to enhance spatial features and recursively integrate them along the timeline. Specifically, the Trigeminal Attention Fusion (TAF) model is proposed to perform spatial complementary learning through self-relation and cross-relation attention. The TAF module consists of three branches: the Convolutional Neural Network (CNN) Branch, the Keypoint Branch, and the Global Branch. The CNN Branch uses self-relation attention to extract enhanced local features. The Keypoint Branch employs cross-relation attention to capture pedestrian keypoint-based features and handle occlusion under global guidance. After that, the Temporal Attention Alignment (TAA) model is designed to recursively propagate temporal information between adjacent frames. Furthermore, we design a bottom-up and top-down training strategy to improve the feature learning ability of the model, which can mine high-quality video-level features from frame-level features via bottom-up inference and refine frame-level features under the top-down guidance of video-level semantic feedback. Extensive experiments on four public Re-ID benchmarks demonstrate that our framework outperforms several state-of-the-art methods and performs well even in occlusion cases.
Loading