Abstract: Audio adversarial examples, imperceptible to humans, have been constructed to attack automatic speech recognition (ASR) systems. However, the adversarial examples generated by existing approaches usually involve notable noise, especially during the periods of silence and pauses, which may lead to the detection of such attacks. This paper proposes a new approach to generate adversarial audios using Iterative Proportional Clipping (IPC), which exploits temporal dependency in original audios to significantly limit human-perceptible noise. Specifically, in every iteration of optimization, we use a backpropagation model to learn the raw perturbation on the original audio to construct our clipping. We then impose a constraint on the perturbation at the positions with lower sound intensity across the time domain to eliminate the perceptible noise during the silent periods or pauses. IPC preserves the linear proportionality between the original audio and the perturbed one to maintain the temporal dependency. We show that the proposed approach can successfully attack the latest state-of-the-art ASR model Wav2letter+, and only requires a few minutes to generate an audio adversarial example. Experimental results also demonstrate that our approach succeeds in preserving temporal dependency and can bypass temporal dependency based defense mechanisms.
Code: https://drive.google.com/open?id=14LSY9x5lEhaVtJGtOKNpeD8tpGHBCSEh
Keywords: audio adversarial examples, attack, machine learning
Original Pdf: pdf
12 Replies
Loading