Unsupervised Modality Adaptation in Human Action Recognition via Cross-modal Representation Learning

Published: 10 Oct 2024, Last Modified: 07 Nov 2024UniRepsEveryoneRevisionsBibTeXCC BY 4.0
Track: Extended Abstract Track
Keywords: cross-modal transfer, multimodal learning, human action recognition
TL;DR: Tranfer knowledge across modalities through a unified latent space for human action recognition
Abstract: Despite living in a multi-sensory world, most AI models are limited to textual and visual interpretations of human motion and behavior. In order to unlock the potential of diverse sensors, we investigate a method to transfer knowledge between modalities using the structure of a unified multimodal representation space for human action recognition (HAR). We introduce an understudied cross-modal transfer setting termed Unsupervised Modality Adaptation (UMA), where the modality used in testing is not used in supervised training. We develop three methods to perform UMA: Student-Teacher (ST), Contrastive Alignment (CA), and Cross-modal Transfer Through Time (C3T). Extensive experiments on various camera+IMU datasets demonstrate ST is effective on simple tasks, CA is the most modular and balanced method and C3T is the most robust through temporal noise. In particular, our C3T method introduces novel mechanics of aligning a signal across time-varying latent vectors, and we show that it demonstrates unique robustness to time-related noise, suggesting its potential for developing generalizable models for time-series sensor data.
Submission Number: 21
Loading