Abstract: Automatic speaker verification (ASV) in low-capacity devices utilized for industrial Internet of Things (IoT) applications is faced with two major challenges: lack of annotated training data and model complexity. To address these challenges, this paper introduces the first Vietnamese audio dataset for training a multi-task learning method named Vi-LMM that jointly performs command detection, fake voice recognition, and speaker verification tasks. To optimize Vi-LMM for low-capacity devices, we further employ knowledge distillation to reduce the number of parameters by 3.5 times. An empirical experiment is conducted to evaluate the effectiveness of the proposed method and the results show that Vi-LMM outperforms strong single-task models in terms of both reducing the number of learnable parameters and achieving higher \(F_1\) scores while maintaining comparable error rates.
External IDs:doi:10.1007/978-981-96-4282-3_13
Loading