PeT-KeyStAtion: Parameter-efficient Transformer with Keypoint-guided Spatial-temporal Aggregation for Video-based Person Re-identification

Xingan Ma, Jinhui Yi, Juergen Gall

Published: 01 Jan 2025, Last Modified: 15 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video-based Person Re-identification (ReID) is crucial in visual surveillance, focusing on matching video snippets of individuals across multiple non-overlapping cameras. Existing methods either conduct ReID at the image level without leveraging temporal information, or employ complex temporal information aggregation techniques, which results in substantial network size and reduced performance efficiency. Recent advances in Vision Transformer (ViT) architectures leverage diverse large-scale datasets alongside sophisticated architectures to achieve enhanced fine-grained feature discrimination. To fully explore the potential of ViT architectures without adding substantial additional modules for video-based ReID, we propose PeT-KeyStAtion: a Parameter-efficient Transformer with Keypoint- guided Spatial-temporal Aggregation using a Spatial-Temporal and Keypoint (STK) Module with lightweight adapters. Our framework effectively captures and aggregates spatial, temporal, and keypoint information with only 11% of the parameters compared to full fine-tuning. Extensive experiments show that our method outperforms state-of-the-art baselines on MARS and iLIDS-VID, and achieves promising performance on LS-VID.