EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

ICLR 2026 Conference Submission14247 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: EMG, Zero-shot Gesture Classification, Cross-modal, Representation Learning

TL;DR: We proposed a cross-modal representation learning framework, EMBridge, to align EMG representations with more structured Pose representations. EMBridge enables zero-shot classification on unseen gestures and achieved superior generalization..

Abstract: Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Alternatively, leveraging low-power, cost-effective bio-signals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearable devices. In this work, we aim to enhance EMG representation quality by aligning it with embeddings obtained from structured, high-quality modalities that provide richer semantic guidance, ultimately enabling zero-shot gesture generalization. Specifically, we propose EMBridge, a cross-modal representation learning framework that bridges the modality gap between EMG and pose. EMBridge learns high-quality EMG representations by introducing a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of the embedding spaces. We evaluate EMBridge on both in-distribution and unseen gesture classification tasks and demonstrate consistent performance gains over all baselines. To the best of our knowledge, EMBridge is the first cross-modal representation learning framework to achieve zero-shot gesture classification from wearable EMG signals, showing potential toward real-world gesture recognition on wearable devices.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14247

Loading