Keywords: time series, health, foundation models, multi-modal, domain transfer
TL;DR: Speech foundation models achieve state-of-the-art performance with motion,ECG, and PPG data
Abstract: Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that are domain-independent and achieve state-of-the-art performance on time series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find a particularly strong relevance of the convolutional feature encoders from speech models for wearable sensor tasks. The methods proposed here improve performance and robustness for data-scarce time series tasks, using simple probing methods. This work is a step towards generalized time series models for speech and sensor data, a topic for further exploration.
Submission Number: 25
Loading