Navigating the Impending Arms Race between Attacks and Defenses in LLMs

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Large Language Models, Adversarial Attacks, Arms Race
TL;DR: This paper comments on the upcoming arms race between adversarial attacks and defenses in LLMs and the associated challenges
Abstract: Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. In this context, we reflect on past challenges in the still ongoing arms race between adversarial attacks and defenses in the computer vision domain. Next, we demonstrate substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic’s Claude. We provide guidelines and considerations to navigate these challenges concerning attack goals, attack capabilities, computational effort, attack complexity, and attack surfaces. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that overlooking these guidelines can result in subpar defense evaluations. Such flawed methodologies necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2749
Loading