Continuous Sign Language Recognition With Multi-Scale Spatial-Temporal Feature Enhancement

Published: 01 Jan 2025, Last Modified: 06 Feb 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Continuous Sign Language Recognition (CSLR) seeks to interpret the gestures used by people who are hard of hearing-mute individuals and translate them into natural language, thereby enhancing communication and interaction. A successful CSLR method relies on the continuous tracking of the presenter’s gestures and facial movements. Existing CSLR methods struggle with fully leveraging fine-grained continuous frame information and often overlook the importance of multi-scale feature integration during decoding. To solve the above-mentioned issues, in this paper, we propose a spatial-temporal feature-enhanced network, called STNet for CSLR task. Firstly, for better continuous frame information exploration, based on the optimal transport algorithm, we first propose a spatial resonance module, which is used to extract the global common spatial features of two adjacent frames along the frame sequence. Secondly, we design a frame-wise loss to preserve and enhance the specific features of each frame. Lastly, to emphasize the multi-scale feature fusion, on the decoder side, we design a multi-temporal perception module, to allow each frame to focus on a larger range of other frames and enhance information interaction from different scales. Extensive experiments on three benchmark datasets including PHOENIX14, PHOENIX14-T, and CSL-Daily demonstrate that STNet consistently outperforms state-of-the-art methods, with a notable improvement of 2.9% in CSLR, showcasing its effectiveness and generalizability. Our approach provides a robust foundation for real-world applications such as sign language education and communication tools, while ablation and case studies highlight the impact of each module, paving the way for future research in CSLR.
Loading