A Lightweight End-to-End Multi-task Learning System for Vietnamese Speaker Verification

Mai Hoang Dao, Son Thai Nguyen, Duy Minh Le, Cong Tran, Cuong Pham

Published: 01 Jan 2025, Last Modified: 10 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Automatic speaker verification (ASV) in low-capacity devices utilized for industrial Internet of Things (IoT) applications is faced with two major challenges: lack of annotated training data and model complexity. To address these challenges, this paper introduces the first Vietnamese audio dataset for training a multi-task learning method named Vi-LMM that jointly performs command detection, fake voice recognition, and speaker verification tasks. To optimize Vi-LMM for low-capacity devices, we further employ knowledge distillation to reduce the number of parameters by 3.5 times. An empirical experiment is conducted to evaluate the effectiveness of the proposed method and the results show that Vi-LMM outperforms strong single-task models in terms of both reducing the number of learnable parameters and achieving higher \(F_1\) scores while maintaining comparable error rates.

External IDs:doi:10.1007/978-981-96-4282-3_13