Abstract: The primary challenge in unsupervised gait recognition lies in generating meaningful and diverse supervisory signals to guide representation learning. The effectiveness of such methods largely depends on the richness of the supervisory signals. Unlike previous methods that construct supervisory signals solely from a single modality, we propose a novel framework, named Multimodal Mutual Learning (M3L), that leverages the identity consistency and complementary nature of both silhouette and skeleton modalities to generate richer and more informative supervisory signals. To fully leverage the richer supervisory signals, M3L encourages mutual prediction between the silhouette and skeleton modalities, guiding the network toward modality-invariant representations. However, mutual prediction alone is hindered by the inherent modality gap, so we introduce a Multimodal Collaborative Module to explicitly bridge this gap and promote cross-modal knowledge transfer. Moreover, to make the framework practical when only one modality is available at inference, we introduce a Multimodal Disentanglement Module. Multimodal Disentanglement Module decouples the two branches and distills a shared representation, preserving the gains of multimodal training while allowing the model to maintain robust performance under single-modality conditions. Extensive experiments on four widely used gait datasets—Gait3D, GREW, CASIA-B, and SUSTech1K—demonstrate the effectiveness of our approach and highlight its potential to advance unsupervised gait recognition.
External IDs:doi:10.1109/tifs.2025.3602638
Loading