IML-Spikeformer: Input-Aware Multilevel Spiking Transformer for Speech Processing

Published: 07 Oct 2025, Last Modified: 03 Feb 2026IEEE Transactions on Neural Networks and Learning SystemsEveryoneCC BY 4.0
Abstract: Spiking neural networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional artificial neural networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: 1) the high computational overhead during training caused by multitimestep spike firing and 2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce the input-aware multilevel spikeformer (IML-Spikeformer), a spiking transformer architecture specifically designed for large-scale speech processing. Central to our design is the input-aware multilevel spike (IMLS) mechanism, which simulates multitimestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a reparameterized spiking self-attention (RepSSA) module with a hierarchical decay mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multiscale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates (WERs) of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64× and 4.32× , respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. Our source code and model checkpoints are publicly available at github.com/Pooookeman/IML-Spikeformer
Loading