Language as a Bridge: Semantic-Guided Cross-Modal Gait Recognition via Text Prototype and Feature Decoupling

Zhiyang Lu, Wankang Zeng, Ming Cheng, Cheng Wang

Published: 01 Jan 2026, Last Modified: 06 May 2026IEEE Transactions on Information Forensics and SecurityEveryoneRevisionsCC BY-SA 4.0

Abstract: Gait recognition aims to identify individuals based on walking patterns in a long-range, contactless manner. While camera-based methods have advanced significantly, their performance deteriorates under poor lighting conditions. LiDAR offers a promising alternative by capturing accurate 3D gait information regardless of illumination. However, effectively integrating heterogeneous data from diverse sensors, such as LiDAR and cameras, remains a key challenge for cross-modal gait recognition. Existing approaches often minimize modality discrepancy directly, which can lead to class collapse and damage to inter-class discriminability. To overcome these limitations, we propose a Semantic-Guided Cross-modal Gait recognition framework, SG-CrossGait, that introduces text features as the prototype space to bridge camera and LiDAR modalities. We design structured Gait Description Factors (GDF) and leverage multimodal large language models (MLLMs) for automatic factor annotation and text generation, enriching existing datasets with textual descriptions, yielding SUSTech1K-Text and FreeGait-Text. A CLIP-based pipeline aligns multi-grained representations from both modalities to the text prototype space. We further propose the Dual-stream Cross-attention Fusion (DCF) module for fine-grained feature integration and the Semantic-Guided Feature Decoupling (SGFD) module to disentangle shared and modality-specific features. A Multi-task Training (MT) scheme incorporating Gait Attribute Recognition (GAR) further enhances intra-class compactness. Extensive experiments validate the effectiveness of our approach. On SUSTech1K-Text, our method achieves 61% accuracy in LiDAR-to-Camera recognition, outperforming the state-of-the-art method by 8.3%. We also release the Gait-Text benchmark to promote future research at the intersection of gait analysis and vision-language learning. Code and datasets are available at: https://github.com/O-VIGIA/SCCG.git

External IDs:doi:10.1109/tifs.2026.3675459