No Access, No Safety: Free Lunch Adversarial Attacks on Black-box NLP Models

24 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text Adversarial Attacks, Trustworthy artificial intelligence
Abstract: Textual adversarial attacks confuse Natural Language Processing (NLP) models, such as Large Language Models (LLMs), by finely modifying the text, resulting in incorrect decisions. Although existing adversarial attacks are effective, they typically rely on knowing the victim model, using extensive queries, or grasping training data, which limits their real-world applications. In situations where there is neither knowledge of nor access to the victim model, we introduce the Free Lunch Adversarial Attack (FLA), demonstrating that attackers can successfully execute attacks armed only with victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models that approximate the victim’s decision boundaries to enhance ASR. Concurrently, we use diverse adversarial example generation, employing various attack methods to reduce the frequency of model training, balancing effectiveness with efficiency. Experiments with the Emotion and SST5 datasets show that the FLA outperforms existing state-of-the-art methods while lowering the attack cost to zero. More importantly, we discover that FLA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3864
Loading