Spiking Hybrid Attentive Mechanism with Decoupled Layer Normalization for Joint Sound Localization and Classification

17 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to neuroscience & cognitive science
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Spiking Neural Networks, Sound Source Localization and Classification, Hybrid Attentive Mechanism, Layer Normalization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Localizing and identifying sound sources simultaneously through binaural cues is a crucial ability of humans, which facilitates our perception of complex surrounding scenes. Brain-inspired Spiking Neural Network (SNN) offers an energy-efficient and event-driven paradigm thus it is highly suitable for simulating the signal processing of such perceptions in organisms. Despite recent progress, most existing approaches in SNNs solely focus on a single task, disregarding the broad practicality of multitasking, or fail to consider the complementary features from audio modality for explicit enhancement. Inspired by the biological information sharing within multiple tasks, in this study, we propose a powerful multi-feature oriented sound source localization and classification framework based on SNNs, namely SpikSLC-Net. Specifically, we design a novel Spiking Hybrid Attention Fusion (SHAF) mechanism that incorporates spiking self-attention modules and spiking cross-attention modules, which can effectively capture temporal dependencies and align relationships among diverse features. Then, considering the vanilla layer normalization (LN) requires dynamic calculation during runtime and involves a significant amount of floating-point operations, we present a unique training-inference-decoupled LN method (DSLN) for SNNs. To further aggregate the multi-scale audio information, two task-specific heads are introduced for the final direction-of-arrival (DoA) estimation and event class prediction. Experimental results demonstrate that the proposed SpikSLC-Net achieves state-of-the-art performance with only 2 time steps on SLoClas dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 930
Loading