A Lightweight Multi-Variable Spatio-Temporal Convolutional Framework for Dynamic Gesture Recognition

Published: 05 Nov 2025, Last Modified: 30 Jan 20263DV 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dynamic Hand Gesture Recognition, Modern Convolutional Networks, Lightweight Models
Abstract: Transformer-based hybrid architectures have achieved remarkable performance in dynamic hand gesture recognition. However, their high computational overhead and model size limit deployment in resource-limited environments. Motivated by this limitation, we propose the Decoupled Spatio-Temporal Convolutional Network (DSTCNet), a lightweight, pure convolutional framework trained end-to-end delivering high accuracy with a fraction of the complexity. DSTCNet integrates two components: (1) an efficient pseudo-3D spatial backbone, the Pseudo-3D Gated Attentional Fusion Network (P3D-GAFNet), enhancing spatial feature extraction via positional prior injection, and (2) a temporal modeling network, the Multi-Variable Decomposition Temporal Convolutional Network (MVD-TCN), leveraging multi-variable feature decomposition with modern convolutional blocks to capture long-range temporal dependencies without the cost of self-attention. With only 9.6M parameters, DSTCNet matches or surpasses the accuracy of substantially larger models on several challenging benchmarks, while offering high computational efficiency, lower memory usage, and reduced energy consumption—making it a practical solution for deployment on edge devices. Our results demonstrate that modernized pure convolutional architectures can serve as a robust and efficient alternative to hybrid designs, offering valuable insights for the broader field of video understanding.
Submission Number: 396
Loading