A Two-Stage LSTM Based Approach for Voice Activity Detection with Sound Event Classification

Yarong Feng; Zongyi Joe Liu; Yuan Ling; Bruce Ferry

A Two-Stage LSTM Based Approach for Voice Activity Detection with Sound Event Classification

Yarong Feng, Zongyi Joe Liu, Yuan Ling, Bruce Ferry

Published: 01 Jan 2022, Last Modified: 06 Feb 2025ICCE 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We introduce a two-stage approach using LSTM for voice activity detection with sound event classification. This approach proves to be effective when training data is limited. Moreover, it achieves better performance than pre-trained model using large-scale data set (AudioSet). Apart from clip-level accuracy, we also introduce two metrics for evaluating overall audio segmentation accuracy: mean $\mathbf{IoU}$),and mean front miss. On test set, our method achieves 98 % accuracy, 0.95 mean $\mathbf{IoU}$ for speech and 0.99 mean $\mathbf{IoU}$ for music, and 0.03 mean front miss for both speech and music.

Loading