High Precision Sound Event Detection based on Transfer Learning using Transposed Convolutions and Feature Pyramid Network
Abstract: We introduce two models for high precision sound event detection leveraging transfer learning. The sound events we detect include “speech”, “music”, and “chime”. Both models consist of a CNN backbone pre-trained using AudioSet for audio classification. To get high precision detection results, the first model employs transposed convolutional layers as the detection head, while the second model uses Feature Pyramid Network(FPN) as the detection head. Experimental results show 98.8% accuracy and 98.6% F1 score on a private test set, from the one using FPN. Both models outperform a two-stage model using LSTM, various model ensembles, and a pre-trained neural network model for audio classification.
Loading