Abstract: Transformer architectures are popularized in both vision and natural language processing tasks, and they have achieved new performance benchmarks because of their long-term dependencies modeling, efficient parallel processing, and increased model capacity. While transformers offer powerful capabilities, their demanding computational requirements clash with the real-time and energy-efficient needs of edge-oriented human activity recognition. It is necessary to compress the transformer to reduce its memory consumption and accelerate the inference. In this paper, we investigated the binarization of a transformer-DeepViT for efficient human activity recognition. For feeding sensor signals into DeepViT, we first processed sensor signals to spectrograms by using wavelet transform. Then we applied three methods to binarize DeepViT and evaluated it on three public benchmark datasets for sensor-based human activity recognition. Compared to the full-precision DeepViT, the fully binarized one (Bi-DeepViT) reduced about 96.7% model size and 99% BOPs (Bit Operations) with only a little accuracy compromised. Furthermore, we explored the effects of binarizing various components and latent binarization of DeepViT to understand their impact on the model. We also validated the performance of Bi-DeepViTs on two wireless sensing datasets. The result shows that a certain partial binarization can improve the performance of DeepViT. Our work is the first to apply a binarized transformer in HAR.
Loading