Hierarchical Spatial-Temporal Enhancement Network For Continuous Sign Language Recognition

Published: 2025, Last Modified: 04 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In continuous sign language recognition (CSLR), 2D-CNN-based extractors are often insufficiently trained for spatial capture and struggle with temporal modeling. This leads to incomplete spatial discrimination, hindering the understanding actions across frames. To address these limitations, we propose Hierarchical Spatial-Temporal Enhancement network (HSTE) through two key modules: Cross-scale Semantic Alignment (CSA) and Temporal Extension Shift (TES). CSA innovatively utilizes multi-scale features generated within the network, enriching feature representation through semantic alignment across scales. By integrating a novel temporal shift strategy with dilated convolutions, TES expands the receptive field and captures temporal changes between frames. These modules work independently and are hierarchically integrated into the network in a plug-and-play manner. Extensive experiments show that our method achieves state-of-the-art performance on the challenging CSLR benchmarks: PHOENIX14, PHOENIX14-T, and CSL-Daily. Code will be available at https://github.com/justlis/HSTENet.
Loading