SCoPe: Submodular Combinatorial Prototype Learner for Continuous Speech Keyword Spotting

ACL ARR 2024 June Submission5307 Authors

16 Jun 2024 (modified: 31 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Keyword Spotting (KwS) in the continuous speech setting encapsulates localization and recognition of keywords amongst a large volume of non-keyword tokens, further exemplified by variation in speakers and presence of rare keywords. Our paper presents a novel Submodular Combinatorial Prototype (SCoPe) learning framework that not only contrasts between target keywords but also ensures sufficient separation of keywords with non-keyword tokens. Additionally, our work proposes a weakly-supervised training strategy, utilizing forced-alignment on phoneme level embeddings to guide a windowing function to correctly localize keywords of interest. We evaluate our model on the popular LibriSpeech and L2-Arctic datasets under varying numbers of keywords demonstrating a class-imbalanced distribution, and show that our proposed architecture consistently outperforms existing baselines by upto 1.8%.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: spoken language understanding
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Submission Number: 5307
Loading