A Two-Stage LSTM Based Approach for Voice Activity Detection with Sound Event Classification

Published: 01 Jan 2022, Last Modified: 06 Feb 2025ICCE 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce a two-stage approach using LSTM for voice activity detection with sound event classification. This approach proves to be effective when training data is limited. Moreover, it achieves better performance than pre-trained model using large-scale data set (AudioSet). Apart from clip-level accuracy, we also introduce two metrics for evaluating overall audio segmentation accuracy: mean $\mathbf{IoU}$),and mean front miss. On test set, our method achieves 98 % accuracy, 0.95 mean $\mathbf{IoU}$ for speech and 0.99 mean $\mathbf{IoU}$ for music, and 0.03 mean front miss for both speech and music.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview