GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Media Interpretation
Relevance To Conference: Multimedia: In multimedia, videos are a common medium for information acquisition. Our method encodes videos to identify pedestrians via their gait across different videos, thereby facilitating enhanced information retrieval and enriching user experience. Furthermore, our approach is tailored for gait recognition in wild scenarios, having a global temporal receptive field and superior generalizability. It is unaffected by variations in video length, making it well-suited for the complex network environments typical of multimedia applications. Multimodal: Gait encompasses multiple modalities, including silhouette, semantic human parsing, and skeleton. Our method utilizes silhouette sequences as the input modality, offering a novel approach to processing silhouette sequences. This methodology enables enhanced extraction of feature representations specific to this modality, thereby facilitating the advancement of future multimodal fusion development. We believe that employing our method to process the silhouette modality can enhance the performance of multimodal gait recognition.
Supplementary Material: zip
Submission Number: 461
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview