Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Yujian Feng, Feng Chen, Jian Yu, Yimu Ji, Fei Wu, Tianliang Liu, Shangdong Liu, Xiao-Yuan Jing, Jiebo Luo

Published: 01 Jan 2024, Last Modified: 13 Mar 2026IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: Video-based visible-infrared person re-identification (VVI-ReID) aims to match the identity of a person captured in video sequences from both visible and infrared cameras. The VVI-ReID task requires considering both the spatial relationship between body parts within each frame and the temporal change of appearance between successive frames. Existing VVI Re-ID methods employ Convolutional Neural Networks to extract local spatial features and Long Short-Term Memory to form temporal associations. However, these methods can not effectively capture the global spatial feature and the long-range temporal dependencies in ultra-long sequences. In this paper, we propose a Cross-modality Spatial-temporal Transformer (CST) including a Cross-frame Tube Transformer Module (CTTM) and a Multi-frame Transformer Fusion Module (MTFM) to address these challenges. Firstly, CTTM tokenizes a video clip into multiple 3D tubes, each encapsulating local spatial-temporal information of pedestrians, and then obtains global spatial-temporal representations by establishing the relationship between tubes. Secondly, we design MTFM to exchange information between multiple frames using message tokens, thus modeling the long-range temporal dependencies of features of pedestrians. In addition, to prevent the potential representation collapse caused by triplet-based loss functions, we propose a diversity-consistency (DC) loss function to preserve the diversity and consistency of cross-modality feature representations by imposing variance, invariance, and covariance constraints in feature representations. Extensive benchmark experiments demonstrate that our approach outperforms the state-of-the-art methods with large margins.

External IDs:doi:10.1109/tmm.2024.3354575