Abstract: We introduce a two-stage approach using LSTM for voice activity detection with sound event classification. This approach proves to be effective when training data is limited. Moreover, it achieves better performance than pre-trained model using large-scale data set (AudioSet). Apart from clip-level accuracy, we also introduce two metrics for evaluating overall audio segmentation accuracy: mean $\mathbf{IoU}$),and mean front miss. On test set, our method achieves 98 % accuracy, 0.95 mean $\mathbf{IoU}$ for speech and 0.99 mean $\mathbf{IoU}$ for music, and 0.03 mean front miss for both speech and music.
Loading