Spatial Temporal Aggregation for Efficient Continuous Sign Language Recognition

Published: 01 Jan 2024, Last Modified: 19 Feb 2025IEEE Trans. Emerg. Top. Comput. Intell. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite the recent progress of continuous sign language recognition (CSLR), most state-of-the-art methods process input sign language videos frame by frame to predict sentences. This usually causes a heavy computational burden and is inefficient and even infeasible in real-world scenarios. Inspired by the fact that videos are inherently redundant where not all frames are essential for recognition, we propose spatial temporal aggregation (STAgg) to address this problem. Specifically, STAgg synthesizes adjacent similar frames into a unified robust representation before being fed into the recognition module, thus highly reducing the computation complexity and memory demand. We first give a detailed analysis on commonly-used aggregation methods like subsampling, max pooling and average, and then naturally derive our STAgg from the expected design criterion. Compared to commonly used pooling and subsampling counterparts, extensive ablation studies verify the superiority of our proposed three diverse STAgg variants in both accuracy and efficiency. The best version achieves comparative accuracy with state-of-the-art competitors, but is 1.35× faster with only 0.50× computational costs, consuming 0.70× training time and 0.65× memory usage. Experiments on four large-scale datasets upon multiple backbones fully verify the generalizability and effectiveness of the proposed STAgg. Another advantage of STAgg is enabling more powerful backbones, which may further boost the accuracy of CSLR under similar computational/memory budgets. We also visualize the results of STAgg to support intuitive and insightful analysis of the effects of STAgg.
Loading