Keywords: Spiking Transformer, Spiking Neural Networks, Energy Efficiency, Language Modeling
TL;DR: Winner-Take-All Spiking Transformer
Abstract: Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient dynamics of Spiking Neural Networks (SNNs), have achieved strong results in neuromorphic and vision tasks and attracted increasing attention.
However, existing directly trained spiking transformers primarily focus on vision tasks with encoder-only architectures. In language modeling, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment.
To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on these, we design WTA-based Encoder-only Spiking Transformer (WE-SpikingFormer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-SpikingFormer) for causal language modeling, systematically exploring direct-training-based softmax-free fully spike-driven transformers for natural language processing.
Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11458
Loading