Multimodal Mutual Learning for Unsupervised Gait Recognition

Shaopeng Yang, Saihui Hou, Xu Liu, Chunshui Cao, Kang Ma, Yongzhen Huang

Published: 01 Jan 2025, Last Modified: 07 Jan 2026IEEE Transactions on Information Forensics and SecurityEveryoneRevisionsCC BY-SA 4.0

Abstract: The primary challenge in unsupervised gait recognition lies in generating meaningful and diverse supervisory signals to guide representation learning. The effectiveness of such methods largely depends on the richness of the supervisory signals. Unlike previous methods that construct supervisory signals solely from a single modality, we propose a novel framework, named Multimodal Mutual Learning (M3L), that leverages the identity consistency and complementary nature of both silhouette and skeleton modalities to generate richer and more informative supervisory signals. To fully leverage the richer supervisory signals, M3L encourages mutual prediction between the silhouette and skeleton modalities, guiding the network toward modality-invariant representations. However, mutual prediction alone is hindered by the inherent modality gap, so we introduce a Multimodal Collaborative Module to explicitly bridge this gap and promote cross-modal knowledge transfer. Moreover, to make the framework practical when only one modality is available at inference, we introduce a Multimodal Disentanglement Module. Multimodal Disentanglement Module decouples the two branches and distills a shared representation, preserving the gains of multimodal training while allowing the model to maintain robust performance under single-modality conditions. Extensive experiments on four widely used gait datasets—Gait3D, GREW, CASIA-B, and SUSTech1K—demonstrate the effectiveness of our approach and highlight its potential to advance unsupervised gait recognition.

External IDs:doi:10.1109/tifs.2025.3602638