DiceFormer: Spiking Audio Transformer with Density-Aware Dice Attention

DiceFormer: Spiking Audio Transformer with Density-Aware Dice Attention

ICLR 2026 Conference Submission16234 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spiking neural network, Spiking transformer, Density-aware spike attention, Audio classification, Frequency-temporal modeling

Abstract: Spiking Neural Networks (SNNs) have garnered significant attention due to their potential for low energy consumption. However, their application in the audio domain remains relatively underexplored. This work aims to close this gap by designing spiking transformers suitable for audio processing applications. We introduce DiceFormer, a directly trained spiking transformer that incorporates two novel components: (i) Spike Dice Attention (SDA), a spike-based attention module that leverages the Dice similarity concept to produce density-aware attention scores, which improve the modeling of spike-based representations; and (ii) Spike Audio Dice Attention (SADA), Spike Audio Dice Attention (SADA), an SDA-based extension specifically designed to handle the frequency–temporal features inherent in complex audio spectrograms. Extensive experiments demonstrate that DiceFormer achieves superior performance over existing state-of-the-art (SOTA) SNNs on mainstream audio datasets. Notably, when trained from scratch, DiceFormer achieves an mAP of 0.161 on AudioSet (20K) with only 54.3M parameters, substantially outperforming prior models. It also establishes new SOTA results on ESC-50 and SCV2, highlighting the promise of SNNs in complex audio processing.

Primary Area: applications to neuroscience & cognitive science

Submission Number: 16234

Loading