Enhancing Adversarial Robustness of LLMs with Self-GuardDownload PDF

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone
Abstract: With the growing impact of large language models (LLMs) across various applications, it has become an increasingly urgent concern to ensure LLMs' robustness. Traditional adversarial defense methods typically involve costly model retraining to enhance adversarial robustness (AR), which is prohibitive in the case of LLMs. To address this challenge, in this paper, we introduce Self-Guard framework to protect the robustness of the inference process of LLMs. Our framework leverages learning from AI feedback, thereby eliminating the need for training and optimization. It interactively inspects and refines potential risks in the input text, and then rectifies the LLMs' outputs for answer alignment. We evaluate our framework with four representative LLMs, GPT-3.5, Falcon, Llama2, and StableBeluga2, on all the five tasks of AdvGLUE benchmark. The experimental results demonstrate that our proposed framework significantly enhances the adversarial robustness of LLMs, achieving 6.3% performance improvement of GPT-3.5 on average accuracy.
Paper Type: long
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies

Loading