Speed Master: Quick or Slow Play to Attack Speaker Recognition

Zhe Ye, Wenjie Zhang, Ying Ren, Xiangui Kang, Diqun Yan, Bin Ma, Shiqi Wang

Published: 2025, Last Modified: 22 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Backdoor attacks pose a significant threat during the model's training phase. Attackers craft pre-defined triggers to break deep neural networks, ensuring the model accurately classifies clean samples during inference yet erroneously classifies samples added with these triggers. Recent studies have shown that speaker recognition systems trained on large-scale data are susceptible to backdoor attacks. Existing attackers employ unnoticed ambient sounds as triggers. However, these sounds are not inherently part of the training samples themselves. In essence, triggers can be designed to maintain an intrinsic connection with the original speech to enhance stealthiness. Our paper presents a novel attack methodology named Speed Master, which undermines deep neural networks by manipulating the speed of speech samples. Specifically, we execute poison-only backdoor attacks using speed or tempo adjustment. Changes in speech rate have become a common occurrence, as seen on platforms that allow users to adjust playback speed. In real-world scenarios, people naturally adjust their speaking rate depending on the context. As a result, changes in a speaker’s speech rate are typically perceived as normal and are unlikely to raise suspicion. Furthermore, detecting such subtle adjustments becomes challenging for users without reference speech. Our comprehensive experiments demonstrate that Speed Master can achieve an ASR over 99% in the digital domain, with only a 0.6% poisoning rate. Additionally, we validate the feasibility of Speed Master in the real world and its resistance to typical defensive measures.