Cross-modal Transfer Through Time for Human Activity Recogntion

ICLR 2026 Conference Submission21630 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human activity recognition, cross-modal transfer, multimodal learning, IMU, RGB videos
TL;DR: This works proposes C3T, a method to transfer knowledge across modalities through temporal latent space alignment for IMU based HAR.
Abstract: Cross-modal knowledge transfer between time-series sensors remains a critical challenge for robust Human Activity Recognition (HAR) systems. Effective cross-modal transfer exploits knowledge from one modality to train models for a completely unlabeled target modality—a problem setting we refer to as Unsupervised Modality Adaptation (UMA). Existing methods typically compress continuous-time data samples into single latent vectors during alignment, limiting their ability to transfer temporal information through real-world temporal distortions. To address this, we introduce Cross-modal Transfer Through Time (C3T), which preserves fine-grained temporal information during alignment to handle dynamic sensor data better. C3T achieves this by aligning a set of temporal latent vectors across sensing modalities. Our extensive experiments on various camera+IMU datasets demonstrate that C3T outperforms existing methods in UMA by over 8\% in accuracy and shows superior robustness to temporal distortions such as time-shift, misalignment, and dilation. Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for various multimodal applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21630
Loading