Efficient Hand Gesture Recognition using Multi-Task Multi-Modal Learning and Self-Distillation

Jie-Ying Li, Herman Prawiro, Chia-Chen Chiang, Hsin-Yu Chang, Tse-Yu Pan, Chih-Tsun Huang, Min-Chun Hu

Published: 2023, Last Modified: 05 Nov 2025MMAsia 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose a lightweight model for hand gesture recognition using an RGB camera. The proposed model enables recognition of first-person hand gestures using a single camera and achieves near-real-time computational performance on both high-end and low-end computing devices. The proposed framework utilizes multi-task multi-modal learning and self-distillation to deal with the challenges in hand gesture recognition. We integrate additional modalities (depth) and a future prediction mechanism to enhance the model’s ability to learn spatio-temporal information. Furthermore, we employ self-distillation to compress the model, achieving a balance between accuracy and computational efficiency. We compared the proposed hand gesture recognition model with the state-of-the-art method, and our model outperforms the SOTA by 0.88% and 3.52% on the EgoGesture and NVGesture datasets, respectively. In terms of computational efficiency, our model takes only 161ms in average to recognize a gesture on a device with low-end GPUs (NVIDIA Jetson TX2), which is acceptable for interaction in XR applications.