Abstract: Spiking Neural Networks (SNNs) have been in-creasingly investigated for audio recognition due to the low power consumption on neuromorphic hardware by mimicking biological neural systems. Since the SNNs are learned from spikes, a critical step lies in the efficient neural encoding of real-valued sound signals to represent complex temporal patterns in speech and environmental sounds. In this paper, we propose a novel Bipolar Population Threshold (BPT) encoding model that effectively captures the trajectory information of time-series speech data by combining temporal and spatial dimensions. The bipolar encoding technique uses positive and negative neurons to capture the dynamic changes in the audio signal, while the threshold intervals allow for a sparse representation that focuses on encoding significant changes, resulting in an efficient and simplified recognition process. Extensively experimenting on three benchmark datasets including the TIDIGITS with speeches, RWCP with sounds, and MedleyDB with music, the numeric results show the superiority of the proposed method by consistently outperforming the state-of-the-art approaches while with fewer spikes, especially in capturing the complex spatio-temporal patterns of audio signals.
0 Replies
Loading