Enhancing Adversarial Robustness of LLMs with Self-Guard

Anonymous

Enhancing Adversarial Robustness of LLMs with Self-Guard

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone

Abstract: With the growing impact of large language models (LLMs) across various applications, it has become an increasingly urgent concern to ensure LLMs' robustness. Traditional adversarial defense methods typically involve costly model retraining to enhance adversarial robustness (AR), which is prohibitive in the case of LLMs. To address this challenge, in this paper, we introduce Self-Guard framework to protect the robustness of the inference process of LLMs. Our framework leverages learning from AI feedback, thereby eliminating the need for training and optimization. It interactively inspects and refines potential risks in the input text, and then rectifies the LLMs' outputs for answer alignment. We evaluate our framework with four representative LLMs, GPT-3.5, Falcon, Llama2, and StableBeluga2, on all the five tasks of AdvGLUE benchmark. The experimental results demonstrate that our proposed framework significantly enhances the adversarial robustness of LLMs, achieving 6.3% performance improvement of GPT-3.5 on average accuracy.

Paper Type: long

Research Area: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.

0 Replies

Loading