SNN-Driven Multimodal Human Action Recognition via Sparse Spatial-Temporal Data Fusion

16 Sept 2025 (modified: 22 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal, Action Recognition, Spiking Neural Networks (SNNs)
Abstract: Recent multimodal action recognition approaches that combine RGB and skeleton data have achieved strong performance, but their high computational cost and poor energy efficiency hinder deployment on edge devices. To address these limitations, we propose the first spiking neural network (SNN)-based framework for multimodal human action recognition, to the best of our knowledge, offering an energy-efficient and scalable solution that fuses sparse spatiotemporal data of event cameras and skeletons within a unified spiking architecture. The framework leverages the sparse and asynchronous nature of event and skeleton data and the energy-efficient properties of SNNs. It achieves this through a series of tailored components, including modality-specific feature extraction, a sparse semantic extractor, spiking-based cross-modal fusion via Spiking Cross Mamba, and task-relevant feature compression utilizing a Discretized Information Bottleneck (DIB). To support reproducible evaluation, we further introduce a data construction pipeline that generates temporally aligned event-skeleton pairs from existing RGB-skeleton datasets. Extensive experiments demonstrate that our approach achieves state-of-the-art accuracy among SNNs while significantly reducing energy consumption, providing a practical and scalable solution for neuromorphic multimodal action recognition.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6789
Loading