Abstract: With the growing impact of large language models (LLMs) across various applications, it has become an increasingly urgent concern to ensure LLMs' robustness. Traditional adversarial defense methods typically involve costly model retraining to enhance adversarial robustness (AR), which is prohibitive in the case of LLMs. To address this challenge, in this paper, we introduce Self-Guard framework to protect the robustness of the inference process of LLMs. Our framework leverages learning from AI feedback, thereby eliminating the need for training and optimization. It interactively inspects and refines potential risks in the input text, and then rectifies the LLMs' outputs for answer alignment. We evaluate our framework with four representative LLMs, GPT-3.5, Falcon, Llama2, and StableBeluga2, on all the five tasks of AdvGLUE benchmark. The experimental results demonstrate that our proposed framework significantly enhances the adversarial robustness of LLMs, achieving 6.3% performance improvement of GPT-3.5 on average accuracy.
0 Replies
Loading