Keywords: Wearables, Foundation Models, Convolutional Neural Networks, Efficiency, Inductive Bias
TL;DR: A lightweight U-Net–based masked autoencoder foundation model for wearable PPG signals that achieves competitive clinical classification accuracy while being two to three orders of magnitude more efficient than transformer baselines
Abstract: We propose a lightweight foundation model for wearable signals that leverages convolutional inductive biases within a masked autoencoder and U-Net CNN backbone. By explicitly encoding temporal locality and multi-scale structure, our approach aligns more naturally with the nonstationary dynamics of physiological waveforms than attention-based transformers. Pretrained on 80k hours of photoplethysmogram (PPG), the model matches or surpasses larger state-of-the-art baselines across ten clinical classification tasks. At the same time, it achieves two to three orders of magnitude reductions in parameters (0.31M vs. 110M), memory footprint (3.6 MB vs. 441.3 MB), and compute, while delivering substantial speedups ($\sim$4× CPU, $\sim$20× GPU) with resolution flexibility. Together, these results establish compact convolutional self-supervised models as both scientifically aligned and practically scalable foundations for potential real-time on-device healthcare monitoring.
Submission Number: 45
Loading