Doodle to Detect: A Goofy but Powerful Approach to Skeleton-based Hand Gesture Recognition

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hand Gesture Recognition, Skeleton based Action Recognition, Online Recognition, Modality Transform, Vision Transformer
TL;DR: We proposed novel SKETCH method that extracts high level feature transforming a raw four-dimensional online skeleton point coordinates data to graph images.
Abstract: Skeleton-based hand gesture recognition plays a crucial role in enabling intuitive human–computer interaction. Traditional methods have primarily relied on hand-crafted features—such as distances between joints or positional changes across frames—to alleviate issues from viewpoint variation or body proportion differences. However, these hand-crafted features often fail to capture the full spatio-temporal information in raw skeleton data, exhibit poor interpretability, and depend heavily on dataset-specific preprocessing, limiting generalization. In addition, normalization strategies in traditional methods, which rely on training data, can introduce domain gaps between training and testing environments, further hindering robustness in diverse real-world settings. To overcome these challenges, we exclude traditional hand-crafted features and propose Skeleton Kinematics Extraction Through Coordinated grapH (SKETCH), a novel framework that directly utilizes raw four-dimensional (time, x, y, and z) skeleton sequences and transforms them into intuitive visual graph representations. The proposed framework incorporates a novel learnable Dynamic Range Embedding (DRE) to preserve axis-wise motion magnitudes lost during normalization and visual graph representations, enabling richer and more discriminative feature learning. This approach produces a graph image that richly captures the raw data’s inherent information and provides interpretable visual attention cues. Furthermore, SKETCH applies independent min–max normalization on fixed-length temporal windows in real time, mitigating degradation from absolute coordinate fluctuations caused by varying sensor viewpoints or differences in individual body proportions. Through these designs, our approach becomes inherently topology-agnostic, avoiding fragile dependencies on dataset- or sensor-specific skeleton definitions. By leveraging pre-trained vision backbones, SKETCH achieves efficient convergence and superior recognition accuracy. Experimental results on SHREC’19 and SHREC’22 benchmarks show that it outperforms state-of-the-art methods in both robustness and generalization, establishing a new paradigm for skeleton-based hand gesture recognition. The code is available at https://github.com/capableofanything/SKETCH.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 27022
Loading