Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning

Katerina Papadimitriou

Multimodal Continuous Fingerspelling Recognition via Visual Alignment Learning

Katerina Papadimitriou

Published: 23 Jun 2025, Last Modified: 23 Jun 2025Greeks in AI 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: fingerspelling recognition, 3D-CNN, ST-GCN, language model, alignment module, language and learning

TL;DR: This work has been accepted at INTERSPEECH 2024, and the official version is available at https://www.isca-archive.org/interspeech_2024/papadimitriou24_interspeech.pdf

Abstract: Continuous fingerspelling recognition from videos is paramount for real-time sign language (SL) interpretation, enhancing accessibility. Despite deep learning progress, challenges persist, especially in signer-independent (SI) scenarios, due to signing variability. To address these, we propose a novel bimodal approach that integrates appearance and skeletal information, focusing solely on the signing hand. Our system relies on two basic modules: (a) a 3D-CNN model capturing spatial features, while adapting to motion variations and (b) a modulated spatio-temporal graph convolutional network (ST-GCN) based on 3D joint-rotation parameterization for skeletal feature modeling. Both modalities are combined with a BiGRU encoder and CTC decoding. To further enhance representation capacity, we introduce an alignment mechanism relying on two auxiliary losses. Through ensemble fusion and language model integration, our method achieves superior performance across three SI fingerspelling datasets. This work has been accepted at INTERSPEECH 2024, and the official version is available at https://www.isca-archive.org/interspeech_2024/papadimitriou24_interspeech.pdf Keywords: Language and Learning

Submission Number: 143

Loading