Abstract: A recent significant advancement in artificial intelligence, particularly in natural language processing and computer vision, is self-supervised-based pretraining, which has been serving as a foundational building block for downstream tasks and applications. In addition, such a framework has also shown a great potential to set a foundation for applications on a wider range of data types. In this paper, we conduct an empirical study aiming to provide more evidence for further understanding a basic question in applying pretrained models to sensor data. Specifically, we aim to further understand how the most widely used training objectives—masked language modeling and contrastive objectives—behave in the inertial sensor domain. We perform our study on human activity data inspired by their wide range of applications. We use encoder architectures and leverage linear probing to test the quality of the learned encoder on different tasks. Our experiments show that masked language modeling is consistently better than contrastive learning. We provide detailed analysis and visualization to demonstrate the effectiveness of masked language modeling on three representative tasks: human activity recognition, inertial odometry, and human inertial posing. While we have focused on these specific tasks, we hope the study will help inspire more research to investigate and explore the effectiveness of pretrained architectures in the sensor domain.
Loading