Audio and ASR-based Filled Pause DetectionDownload PDFOpen Website

2022 (modified: 07 Dec 2022)ACII 2022Readers: Everyone
Abstract: Filled pauses (or fillers) are the most common form of speech disfluencies and they can be recognized as hesitation markers (“um”, “uh” and “er”) made by speakers, usually to gain extra time while thinking their next words. Filled pauses are very frequent in spontaneous speech. Their detection is therefore rather important for two basic reasons: (a) their existence influences the performance of individual components, like Automatic Speech Recognition system (ASR), in human-machine interaction and (b) their frequency can characterize the overall speech quality of a particular speaker, as it can be strongly associated with the speaker's confidence. Despite that, only limited work has been published for the detection of filled pauses in speech, especially through audio. In this work, we propose a framework for filled pause detection using both audio and textual information. For the audio modality, we transfer knowledge from a plethora of supervised tasks, such as emotion or speaking rate, using Convolutional Neural Networks (CNNs). For the text modality, we develop a temporal Recurrent Neural Network (RNN) method that takes into account textual information derived from an ASR system. In addition, the proposed transfer learning approach for the audio classifier leads to better results when benchmarked on our internal dataset for which the text is not transcribed but estimated by an ASR system. In this case, a simple late fusion approach boosts the performance even further. This proves that the audio approach is suitable for real-world applications where the transcribed text is not available and has to leverage imperfect ASR results, or even the absence of textual information (to reduce computational cost).
0 Replies

Loading