DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention

TMLR Paper2903 Authors

21 Jun 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision Transformers (ViT) are current high-performance models of choice for various vision applications. Recent developments have given rise to biologically inspired spiking transformers that thrive in ultra-low power operations on neuromorphic hardware, however, without fully unlocking the potential of spiking neural networks. We introduce DS2TA, a Denoising Spiking transformer with attenuated SpatioTemporal Attention, designed specifically for vision applications. DS2TA introduces a new spiking attenuated spatiotemporal attention mechanism that considers input firing correlations occurring in both time and space, thereby fully harnessing the computational power of spiking neurons at the core of the transformer architecture. Importantly, DS2TA facilitates parameter-efficient spatiotemporal attention computation without introducing extra weights. DS2TA employs efficient hashmap-based nonlinear spiking attention denoisers to enhance the robustness and expressive power of spiking attention maps. DS2TA demonstrates state-of-the-art performances on several widely adopted static image and dynamic neuromorphic datasets. Operated over 4 time steps, DS2TA achieves 94.92% top-1 accuracy on CIFAR10 and 77.47% top-1 accuracy on CIFAR100, as well as 79.1% and 94.44% on CIFAR10-DVS and DVS-Gesture using 10 time steps.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Mathurin_Massias1
Submission Number: 2903
Loading