A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Published: 01 Jan 2024, Last Modified: 19 May 2025IEEE Trans. Intell. Transp. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video-based person Re-Identification (Re-ID) is a hot research topic in intelligent transportation systems, which aims to retrieve video sequences of the same person under non-overlapping surveillance cameras. Compared with static images, video sequences contain more visual information from multiple views, such as spatial and temporal views. However, previous Re-ID methods usually focus on single limited views, lacking diverse observations from different views. To capture richer perceptions and extract more comprehensive representations, we propose a novel learning framework named Trigeminal Transformers (TMT) to tackle video-based person Re-ID. More specifically, we first design a View-wise Projector (VP) to jointly transform raw videos from spatial, temporal and spatial-temporal views. In addition, inspired by the great success of Vision Transformers (ViT), we introduce the Transformer structure for information enhancement and aggregation. In our work, three Self-view Transformers (ST) are proposed to exploit the relationships of local features for information enhancement in spatial, temporal and spatial-temporal. Moreover, a Cross-view Transformer (CT) is proposed to aggregate the multi-view features for comprehensive representations. Experimental results indicate that our approach can obtain better performance than some other state-of-the-art approaches on four public Re-ID benchmarks.
Loading