Abstract: Spoken term detection (STD) is the task of determining whether and where a given word or phrase appears in a given segment of speech. Algorithms for STD are often aimed at maximizing the gap between the scores of positive and negative examples. As such they are focused on ensuring that utterances where the term appears are ranked higher than utterances where the term does not appear. However, they do not determine a detection threshold between the two. In this paper, we propose a new approach for setting an absolute detection threshold for all terms by introducing a new calibrated loss function. The advantage of minimizing this loss function during training is that it aims at maximizing not only the relative ranking scores, but also adjusts the system to use a fixed threshold and thus enhances system robustness and maximizes the detection accuracy rates. We use the new loss function in the structured prediction setting and extend the discriminative keyword spotting algorithm for learning the spoken term detector with a single threshold for all terms. We further demonstrate the effectiveness of the new loss function by applying it on a deep neural Siamese network in a weakly supervised setting for template-based spoken term detection, again with a single fixed threshold. Experiments with the TIMIT, WSJ and Switchboard corpora showed that our approach not only improved the accuracy rates when a fixed threshold was used but also obtained higher Area Under Curve (AUC).
TL;DR: Spoken Term Detection, using structured prediction and deep networks, implementing a new loss function that both maximizes AUC and ranks according to a predefined threshold.
Keywords: Spoken term detection, keyword spotting, AUC maximization, structured prediction, deep-neural networks