Continuous Hand Gesture Spotting through Deep Sequential Encoding and Probabilistic Time-Series Modeling

Hyeonkyu Lee; Young-Eun Lee; JaeHeung Park

Continuous Hand Gesture Spotting through Deep Sequential Encoding and Probabilistic Time-Series Modeling

Hyeonkyu Lee, Young-Eun Lee, JaeHeung Park

15 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Continuous Hand Gesture Spotting, Gesture recognition, Hidden Markov Model, LSTM Autoencoder, MediaPipe Hands, Deep sequential encoding, Probabilistic Time-Series Modeling, Threshold -based Filtering

TL;DR: A hybrid framework with MediaPipe Hands, LSTM Autoencoder, and Gaussian HMMs achieves 96.6% accuracy, 97.9% F1, and 6.6% WER in continuous gesture spotting, enabling accurate, data-efficient VR/AR interaction.

Abstract: Continuous hand gesture spotting in real time is a challenging problem because ambiguous gesture boundaries and abundant non-gesture motions often confound recognition systems. Unlike isolated recognition, spotting requires detecting both the onset and offset of gestures while rejecting irrelevant transitions, making robustness crucial for practical human–computer interaction. We present a hybrid framework that integrates MediaPipe Hands for extracting 3D landmarks, an LSTM Autoencoder for compact spatiotemporal encoding, and Gaussian Hidden Markov Models (HMMs) for probabilistic sequence modeling. To further suppress spurious detections during transitions, we introduce an ergodic threshold mechanism that adaptively filters low-likelihood segments. On a vocabulary of 10 command gestures, the system achieves 96.56% recognition accuracy, 97.89% segmental F1, and 6.55% word error rate (WER) in continuous input streams, while remaining lightweight enough to run on a CPU-only device. These results show that combining deep representation learning with probabilistic dynamics yields reliable boundary detection without heavy computational overhead. Beyond empirical gains, the framework is data-efficient and readily extensible to new vocabularies, enabling rapid adaptation with limited training data. Overall, these findings demonstrate the practical feasibility of robust gesture spotting, bridging the gap between controlled research settings and real-world applications in VR/AR environments and customizable user interfaces.

Supplementary Material: zip

Primary Area: learning on time series and dynamical systems

Submission Number: 5297

Loading