PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Spiking Neural Networks (SNNs) have indeed shown remarkable promise in the field of computer vision, emerging as a low-energy alternative to traditional Artificial Neural Networks (ANNs). However, SNNs also face several challenges: \romannumeral1) Existing SNNs are not purely additive and involve a substantial amount of floating-point computations, which contradicts the original design intention of adapting to neuromorphic chips; \romannumeral2) The incorrect positioning of convolutional and pooling layers relative to spiking layers leads to reduced accuracy; \romannumeral3) Leaky Integrate-and-Fire (LIF) neurons have limited capability in representing local information, which is disadvantageous for downstream visual tasks like semantic segmentation. \par To address the challenges in SNNs, \romannumeral1) we introduce Pure Sparse Self Attention (PSSA) and Dynamic Spiking Membrane Shortcut (DSMS), combining them to tackle the issue of floating-point computations; \romannumeral2) the Spiking Precise Gradient downsampling (SPG-down) method is proposed for accurate gradient transmission; \romannumeral3) the Group-LIF neuron concept is introduced to ensure LIF neurons' capability in representing local information both horizontally and vertically, enhancing their applicability in semantic segmentation tasks. Ultimately, these three solutions are integrated into the Powerful Sparse-Spike-Driven Transformer (PSSD-Transformer), effectively handling semantic segmentation tasks and addressing the challenges inherent in Spiking Neural Networks. The experimental results demonstrate that our model outperforms previous results on standard classification datasets and also shows commendable performance on semantic segmentation datasets. The code will be made publicly available after the paper is accepted for publication.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The submission to your conference is driven by the belief that our work aligns perfectly with the conference's focus on multimodal applications. The PSSD-Transformer not only dose well in image segmentation but also demonstrates a novel approach to integrating different neural network paradigms. This model serves as a bridge between the rapidly evolving fields of SNNs and transformers, offering insights into how these technologies can be combined to tackle complex multimodal challenges.
Supplementary Material: zip
Submission Number: 692
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview