Keywords: Continuous fingerspelling recognition, HAMER, knowledge distillation, sign language recognition
TL;DR: We propose a lightweight fingerspelling recognition model that combines RGB video features with 3D pose supervision during training, achieving high accuracy.
Abstract: Recognizing continuous fingerspelling from monocular RGB video is a highly challenging task due to complex hand articulation, coarticulation effects, and significant inter-signer variability. Prior methods use either raw visual features, which lack structural awareness of fine-grained finger dynamics, or parallel RGB–pose streams from explicit pose estimation, which add substantial inference-time overhead. In this work, we propose a novel knowledge distillation framework that transfers rich hand articulation knowledge from HAMER, a foundation model for 3D hand mesh/pose reconstruction, into a lightweight, RGB-only fingerspelling recognizer. We extract high-level pose embeddings from HAMER’s Transformer head, which encode detailed hand structure, and distill them into a ResNet34-based appearance encoder via a dedicated training objective. Subsequently, the learned pose-aware features are fed into a 1D-CNN and BiGRU for temporal modeling, with the full system trained using both connectionist temporal classification (CTC) and a knowledge distillation loss. Notably, our approach does not rely on the teacher model (HAMER) at inference time, thus enabling real-time performance. We evaluate our method on two American sign language (ASL) benchmark fingerspelling datasets, as well as a studio-quality Greek fingerspelling corpus. Our model achieves state-of-the-art accuracy with over 3× lower inference time than prior methods, offering an effective trade-off between accuracy and efficiency for real-time deployment.
Submission Number: 11
Loading