Keywords: fingerspelling recognition, 3D-CNN, ST-GCN, language model, alignment module, language and learning
TL;DR: This work has been accepted at INTERSPEECH 2024, and the official version is available at https://www.isca-archive.org/interspeech_2024/papadimitriou24_interspeech.pdf
Abstract: Continuous fingerspelling recognition from videos is
paramount for real-time sign language (SL) interpretation, enhancing
accessibility. Despite deep learning progress, challenges
persist, especially in signer-independent (SI) scenarios,
due to signing variability. To address these, we propose a novel
bimodal approach that integrates appearance and skeletal information,
focusing solely on the signing hand. Our system relies
on two basic modules: (a) a 3D-CNN model capturing spatial
features, while adapting to motion variations and (b) a modulated
spatio-temporal graph convolutional network (ST-GCN)
based on 3D joint-rotation parameterization for skeletal feature
modeling. Both modalities are combined with a BiGRU encoder
and CTC decoding. To further enhance representation
capacity, we introduce an alignment mechanism relying on two
auxiliary losses. Through ensemble fusion and language model
integration, our method achieves superior performance across
three SI fingerspelling datasets. This work has been accepted at INTERSPEECH 2024, and the official version is available at https://www.isca-archive.org/interspeech_2024/papadimitriou24_interspeech.pdf
Keywords: Language and Learning
Submission Number: 143
Loading