When One Sense Fails: Towards Multi-Modal Gait Recognition Framework Bridging Vision and Structural Vibration Sensing
Keywords: Gait recognition, Person identification, Vision, Structural vibration, Multi-modal framework
TL;DR: A person identification system that combines video silhouettes (how a person walks) with floor vibrations (the physical impact of their footsteps) to overcome the limitations of video-only tracking.
Abstract: Gait recognition from video silhouettes has seen significant progress, yet occlusions and changes in appearance continue to limit its reliability. Structural vibrations induced by footsteps offer a complementary signal that is inherently privacy-preserving, but this modality still lacks the benchmarks and principled fusion strategies necessary for real-world use. In this work, we introduce a multi-modal framework that combines silhouette sequences with floor-vibration measurements for person identification. Our fusion architecture employs intra-modal self-attention to refine each representation independently, bidirectional cross-modal contextualization to exchange information between the two streams, and a learned gating mechanism that adaptively weights each modality's contribution. We evaluate the approach under four experimental protocols and compare it against several alternative fusion strategies. The proposed model achieves approximately 89\% rank-1 identification accuracy. Further analysis shows that vibration features provide view-invariant cues that complement the appearance information captured by silhouettes, accounting for much of the gain over either modality alone. To encourage reproducible follow-up work, we publicly release our source code, trained models, and evaluation protocols.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading