DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Language models (LMs) are susceptible to adversarial attack methods that generate adversarial examples with minor perturbations. Although recent attack methods can achieve a relatively high attack success rate (ASR), we find that the generated adversarial examples have a different data distribution compared with the original examples. Specifically, these adversarial examples exhibit lower confidence levels and higher distance to the training data distribution. As a result, they are easy to detect using straightforward detection methods, diminishing the effectiveness of these attack methods. To overcome this problem, we propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method, which considers the distribution shift of adversarial examples to improve attack effectiveness under detection methods. We further design a new evaluation metric, Non-detectable Attack Success Rate (NASR), combining ASR and detection for the attack task. We conduct experiments on four widely-used datasets and validate the attack effectiveness and transferability of the adversarial examples generated by DALA on the white-box BERT-base model and the black-box LLaMA2-7b model.
Paper Type: long
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data analysis
Languages Studied: English
0 Replies

Loading