Abstract: Automatic Speech Recognition (ASR) systems are widely deployed in safety-critical settings but remain vulnerable to data-poisoning backdoor attacks. Existing ASR backdoors typically use phrase-level triggers paired with a fixed target sentence, creating strong artifacts (e.g., repeated transcripts or triggers placed in non-speech regions) that simple preprocessing can mitigate. We propose GhostWord, a word-level, time-localized ASR backdoor that uses codebooks mapping short ($\approx$400 ms) acoustic triggers to target words. During poisoning, we inject a trigger into the forced-aligned time span of a chosen source word in the audio and replace only that word in the transcript, enabling precise semantic flips and composable sentence manipulation while avoiding many-to-one label artifacts. Across Common Voice (v23 English, v24 Lithuanian) and multiple backbones (Whisper-Small/Medium, MMS, SpeechT5), GhostWord achieves an average attack success rate of 89.4% and transfers across languages and models. Adapting optimization-based defenses (ABL, ANP, SAU, I-BAU) reveals a sharp robustness--accuracy trade-off: attack success drops from 89.4% to 28.3% while clean WER rises from 21.9 to 47.2%, consistent with a theoretical explanation that, in the high-class regime, optimization-based defenses incur unavoidable clean-performance degradation.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Li_Erran_Li1
Submission Number: 8947
Loading