Abstract: Convolutional neural networks (CNN) have played an important role in Audio Event Classification (AEC). Both 1D-CNN and 2D-CNN methods have been applied to improve the classification accuracy of AEC, and there are many factors affecting the performance of models based on CNN. In this paper, we study different factors affecting the performance of CNN for AEC, including sampling rate, signal segmentation methods, window size, mel bins and filter size. The segmentation method of the event signal is an important one among them. It may lead to overfitting problem because audio events usually happen only for a short duration. We propose a signal segmentation method called Fill-length Processing to address the problem. Based on our study of these factors, we design convolutional neural networks for audio event classification (called FPNet). On the environmental sounds dataset ESC-50, the classification accuracies of FPNet-1D and FPNet-2D achieve 73.90% and 85.10% respectively, which improve significantly comparing to the previous methods.
Loading