Enhancing Speech Activity Detection in Air Traffic Control Communication via Push-to-Talk Event Identification
Abstract: Speech activity detection (SAD) serves as a foundational and critical component for automatic speech recognition and understanding (ASRU) applications in the air traffic control (ATC) domain. However, mid-speech clipping and hangover problems caused by the inaccurate identification of speech endpoints bring significant challenges to the existing SAD approaches in the ATC communication environments. To address these challenges, in this article, a novel ATC-SAD framework is proposed to improve the accuracy of SAD in ATC communication by measuring the release event of the push-to-talk (PTT) switch (denoted as PTT event). Compared to the conventional SAD approaches, the proposed framework can not only distinguish speech and nonspeech signals but also has the ability to detect the PTT events from audio streams, thereby effectively identifying the speech endpoints. To mine informative features from audio signals for the SAD tasks, a multiview feature learning (MFL) module is designed to extract the acoustic features from time, frequency, and cepstrum domains. Furthermore, an attention-based feature aggregation (AFA) module is designed to project the acoustic features into the embedding space. A contrastive learning module is proposed to learn the discriminative features among the three distinct classes, which is expected to improve the performance of the classification task. In addition, to explore more effective neural architectures, four classical neural networks serve as backbone networks to conduct the proposed ATC-SAD framework. Experimental results on a real-world ATC dataset demonstrate the superiority of our proposed framework over competitive baselines, achieving high accuracy and robustness in challenging ATC communication scenarios.
Loading