Adversarial self-attack defense and spatial-temporal relation mining for visible-infrared video person re-identification

Le Xu, Huafeng Li, Yafei Zhang, Dapeng Tao

Published: 2025, Last Modified: 26 Feb 2026Int. J. Mach. Learn. Cybern. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In visible-infrared video person re-identification (re-ID), the key to solving cross-modal pedestrian identity matching lies in extracting features that are robust to complex scene variations, including modality, camera viewpoints, pedestrian pose, and background, as well as effectively mining motion information. To this end, the paper proposes a new visible-infrared video person re-ID method from a novel perspective, i.e., adversarial self-attack defense and spatial-temporal relation mining. In this work, variations in viewpoints, posture, background and modality discrepancies are considered as the primary factors that cause the perturbations of person identity features. The interference information arising from these factors, inherent in the training samples, is treated as adversarial perturbation. During the training process, this perturbation is used to perform adversarial attacks on the re-ID model, thereby enhancing the model’s robustness to these challenging factors. Notably, the adversarial attack is achieved by activating the interference information in the input samples, without generating explicit adversarial samples, and is thus referred to as adversarial self-attack. Furthermore, we propose an adversarial defense strategy that strengthens the network’s resistance to attacks by enhancing the discriminability of pedestrian features and incorporating adversarial training. This design effectively integrates both adversarial attack and defense within a unified framework. Additionally, we present a spatial-temporal information-guided feature representation network that effectively utilizes the information embedded in video sequences. The network not only extracts features from the video-frame sequences but also leverages spatial relationships within local information to guide the extraction of more robust features. The proposed method exhibits compelling performance on large-scale cross-modality video datasets. The source code of our method is available at https://github.com/lhf12278/SADSTRM.

External IDs:dblp:journals/mlc/XuLZT25a