Dilated convolution neural network with LeakyReLU for environmental sound classification

Xiaohu Zhang, Yuexian Zou, Wei Shi

Published: 2017, Last Modified: 13 Nov 2024DSP 2017EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Environmental sound classification task (ESC) is still open and challenging. In contrast to speech, sounds of a specific acoustic event may be produced by a wide variety of sources. Thus for one class, feature spectrums of acoustic events are much more transformative than human speech. In order to learn better high-level feature representations from these transformative feature spectrums, convolution neural network (CNN) has been applied to ESC tasks and achieved state-of-the-art results. However, it is noted that existing CNN-based ESC systems only use small convolution filters (typically 2×2 or 3×3) in CNN model which results in deep CNN model for learning long contextual information of sound events. Besides, to our knowledge, there is no work of evaluating the effect of activation functions on the performance of CNN-based ESC systems, which is actually very critical for improving the performance. Motivated by these findings, in this study, we propose a dilated CNN-based ESC (D-CNN-ESC) system where dilated filters and LeakyReLU activation function are adopted. The main ideas behind our research are that the dilated filters will increase receptive field of convolution layers to incorporate more contextual information. Moreover, the LeakyReLU function brings the tradeoff between the network sparsity and the performance. To evaluate the effectiveness of our proposed D-CNN-ESC system, we conduct several experiments on three sound event datasets. It is encouraged to see that our proposed D-CNN-ESC system outperforms the state-of-the-art ESC results obtained by very deep CNN-ESC system on UrbanSound8K dataset, the absolute error of our method is about 10% less than that of compared method.