Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture

Maki K. Habib, Oluwaleke Yusuf, Mohamed Moustafa

Published: 26 Oct 2025, Last Modified: 23 Jan 2026TechnologiesEveryoneRevisionsCC BY-SA 4.0

Abstract: Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, and computational limitations. This paper presents a lightweight and efficient skeleton-based HGR framework that addresses these challenges through an optimized multi-stream Convolutional Neural Network (CNN) architecture and a trainable ensemble tuner. Dynamic 3D gestures are transformed into structured, noise-minimized 2D spatiotemporal representations via enhanced data-level fusion, supporting robust classification across diverse spatial perspectives. The ensemble tuner strengthens semantic relationships between streams and improves recognition accuracy. Unlike existing solutions that rely on high-end hardware, the proposed framework achieves real-time inference on consumer-grade devices without compromising accuracy. Experimental validation across five benchmark datasets (SHREC2017, DHG1428, FPHA, LMDHG, and CNR) confirms consistent or superior performance with reduced computational overhead. Additional validation on the SBU Kinect Interaction Dataset highlights generalization potential for broader Human Action Recognition (HAR) tasks. This advancement bridges the gap between efficiency and accuracy, supporting scalable deployment in AR/VR, mobile computing, interactive gaming, and resource-constrained environments.

External IDs:doi:10.3390/technologies13110484