Keywords: human activity recognition, cross-modal transfer, multimodal learning, IMU, RGB videos
TL;DR: This works proposes C3T, a method to transfer knowledge across modalities through temporal latent space alignment for IMU based HAR.
Abstract: Cross-modal knowledge transfer between time-series sensors remains a critical challenge for robust Human Activity Recognition (HAR) systems.
Effective cross-modal transfer exploits knowledge from one modality to train models for a completely unlabeled target modality—a problem setting we refer to as Unsupervised Modality Adaptation (UMA).
Existing methods typically compress continuous-time data samples into single latent vectors during alignment, limiting their ability to transfer temporal information through real-world temporal distortions.
To address this, we introduce Cross-modal Transfer Through Time (C3T), which preserves fine-grained temporal information during alignment to handle dynamic sensor data better.
C3T achieves this by aligning a set of temporal latent vectors across sensing modalities.
Our extensive experiments on various camera+IMU datasets demonstrate that C3T outperforms existing methods in UMA by over 8\% in accuracy and shows superior robustness to temporal distortions such as time-shift, misalignment, and dilation.
Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for various multimodal applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21630
Loading