Language Adaptation Wake Word Spotting via Latent Space from Pre-Trained Speech Models

Shifu Xiong, Hengshun Zhou, Kai Shen, Shi Cheng, Hang Chen, Genshun Wan, Kewei Li, Jun Du, Lirong Dai

Published: 2025, Last Modified: 01 Apr 2026APSIPA 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents an approach for multilingual Wake Word Spotting (WWS), ingeniously fusing pre-trained extensive speech models with tailored hidden units dedicated to WWS. Initially, the Whisper encoder functions as a spatial cornerstone, seamlessly integrating with a lightweight ShuffleNet-based encoder, followed by a shared decoder tailored for mono-lingual WWS. Then, to refine the encoder's capability, k-means clustering is leveraged to extract aligned targets from the latent space of the pre-trained speech model, thereby bolstering monolingual performance. Finally, linguistic priors are adaptively incorporated into the proposed framework, facilitating effective multilingual WWS. Experimental evaluations in Spanish and Arabic demonstrate the proficiency of the proposed approach in enhancing the performance of WWS.

External IDs:dblp:conf/apsipa/XiongZS00WL0025