Scale-invariant Online Voice Activity Detection under Various Environments

Ryu Takeda, Kazunori Komatani

Published: 01 Jan 2024, Last Modified: 17 Jul 2025APSIPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Online voice activity detection (VAD) is an important front-end for spoken dialogue systems. However, different signal amplitudes and speech distortions under various environments cause performance degradation of neural VAD models due to the model mismatch. The amplitude and distortion problems were addressed during the feature extraction and training processes of neural networks, respectively. First, the signal amplitude was normalized block-wise to ensures the scale invariance mathematically. Such block-wise normalization was naturally introduced in our formulation of online VAD based on a recursive Bayesian estimation of speech activity. Second, over 1,000 hours of training data was augmented by simulating speech distortions, such as reverberations. Our VAD outperformed open VAD models, such as Silero, for a variety of datasets including a real spoken dialogue dataset in terms of speech and non-speech discrimination.