Abstract: In this paper, we propose a lightweight model for hand gesture recognition using an RGB camera. The proposed model enables recognition of first-person hand gestures using a single camera and achieves near-real-time computational performance on both high-end and low-end computing devices. The proposed framework utilizes multi-task multi-modal learning and self-distillation to deal with the challenges in hand gesture recognition. We integrate additional modalities (depth) and a future prediction mechanism to enhance the model’s ability to learn spatio-temporal information. Furthermore, we employ self-distillation to compress the model, achieving a balance between accuracy and computational efficiency. We compared the proposed hand gesture recognition model with the state-of-the-art method, and our model outperforms the SOTA by 0.88% and 3.52% on the EgoGesture and NVGesture datasets, respectively. In terms of computational efficiency, our model takes only 161ms in average to recognize a gesture on a device with low-end GPUs (NVIDIA Jetson TX2), which is acceptable for interaction in XR applications.
Loading