Ske-Fi: Estimating Hand Poses via RF Vision Under Low Contrast and Occlusion

Jia Xu, Biao Yang, Zhe Chen, Pin Lv, Shujie Zhang, Jun Luo

Published: 01 Jan 2024, Last Modified: 10 Oct 2024IEEE Internet Things J. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Hand pose estimation (HPE), which aims to identify and recover the keypoints of a hand, is essential to many potential applications. Conventional computer vision (CV) methods extract visible features from images or videos captured by cameras. However, they are heavily affected by low image contrast, fail to work under occluded scenarios, and inevitably incur privacy concerns. Fortunately, CV leveraging widely available radio frequency (RF) signals (also known as RF vision) can fully address the problem with much lower computational complexity. In this article, we propose Ske-Fi as an avatar of hand pose estimation (HPE) enabled by RF vision, which uses the emerging impulse radio ultrawide band (IR-UWB) available on smart devices (e.g., Apple air tag) to sense the reflected RF signals of a hand to extract the hand skeleton features for pose estimation. Whereas Ske-Fi is apparently immune to low contrast and occlusion, its substantially reduced resolution provided by IR-UWB signal makes the resulting RF image incomprehensible by human eyes and thus negating offline labeling. To address the challenge, Ske-Fi involves a deep complex-valued neural network Ske-Net trained via a cross-modal supervision framework; it uses a synchronized camera assisted by a state-of-the-art vision network as a teacher to teach Ske-Net as a student in independently performing HPE afterward. Furthermore, for occlusion cases, Ske-Fi adopts an adversarial learning scheme to distill HPE features regardless of diversified occlusions. Our extensive evaluations evidently demonstrate that Ske-Fi outperforms conventional CV solutions which achieves a comparable HPE accuracy under normal circumstances and maintains this accuracy under adverse scenarios.