Light-weight Frequency Information Aware Neural Network Architecture for Voice Spoofing Detection

Published: 01 Jan 2022, Last Modified: 10 Jan 2025ICPR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The voice assistant market is overgrowing, and mainstream services like Bixby (Samsung), Alexa (Amazon), and Siri (Apple) are quickly being upgraded to support advanced commands. Such capabilities make them lucrative targets for attackers to exploit. Voice spoofing attacks involve recording voice commands of a target victim and simply replaying them through a loudspeaker. The "2019 Automatic Speaker Verification Spoofing And Countermeasures Challenge" (ASVspoof) competition aims to facilitate the design of highly accurate voice spoofing attack detection systems. However, most of the presented models do not take frequency-level modeling into account in their modeling architecture and do not consider model complexity. To design a light-weight system with frequency-level modeling, we propose two systems: 1) Double Depthwise Separable (DDWS) convolution and 2) BC-ResNet with max feature map (MFM) activation (BC-ResMax). We evaluate the accuracy by equal error rate (EER) using the ASVspoof 2019 dataset. Our single models of parallel DDWS, sequential DDWS, and BC-ResMax model achieved spoofing attack detection EER of 2.63%, 2.08% and 2.59% in the LA dataset, and 0.47%, 0.63% and 0.49% in the PA dataset, achieving comparable performance with other top ensemble systems from the competition. Furthermore, parallel DDWS, sequential DDWS, and BC-ResMax used only 45K, 28K and 29K numbers of parameters which are far fewer than existing models.
Loading