Hybrid Spiking Vision Transformer for Object Detection with Event Cameras

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Event-based object detection has attracted increasing attention for its high temporal resolution, wide dynamic range, and asynchronous address-event representation. Leveraging these advantages, spiking neural networks (SNNs) have emerged as a promising approach, offering low energy consumption and rich spatiotemporal dynamics. To further enhance the performance of event-based object detection, this study proposes a novel hybrid spike vision Transformer (HsVT) model. The HsVT model integrates a spatial feature extraction module to capture local and global features, and a temporal feature extraction module to model time dependencies and long-term patterns in event sequences. This combination enables HsVT to capture spatiotemporal features, improving its capability in handling complex event-based object detection tasks. To support research in this area, we developed the Fall Detection dataset as a benchmark for event-based object detection tasks. The Fall DVS detection dataset protects facial privacy and reduces memory usage thanks to its event-based representation. Experimental results demonstrate that HsVT outperforms existing SNN methods and achieves competitive performance compared to ANN-based models, with fewer parameters and lower energy consumption.
Lay Summary: Conventional cameras operate at fixed frame rates, leading to motion blur during fast motion and limited dynamic range under extreme lighting. Event cameras asynchronously capture brightness changes with low latency, enabling high temporal resolution and improved performance in challenging conditions. However, how to efficiently process such asynchronous and sparse data is still a big challenge. Therefore, we propose a hybrid spiking vision Transformer model (HsVT) that combines the low-energy properties of brain-like spiking neural networks (SNNs) with the powerful modeling capabilities of Transformer architectures. The HsVT model is built from four blocks, each combining spatial and temporal feature extraction. Spatial features are captured using multi-axis attention mechanisms and multi-layer perceptrons, while temporal dynamics are implemented by LSTM and a novel Spiking Temporal Feature Extraction (STFE) module we designed. We also construct an event-based fall detection dataset, which not only protects user privacy but also reduces the storage cost. On GEN1 datasets, HsVT outperforms existing SNN methods and achieves similar performance to ANN with fewer parameters and lower power consumption.
Primary Area: Applications->Neuroscience, Cognitive Science
Keywords: Spiking Neural Networks, Fall Detection, Object Detection With Event Cameras
Submission Number: 14090
Loading