SELF-GUARD: Empower the LLM to Safeguard ItselfDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose a novel safety measure to protect LLMs from jailbreak attacks.
Abstract: With the increasing risk posed by jailbreak attacks, recent studies have investigated various methods to improve the safety of large language models (LLMs), mainly falling into two strategies: safety training and safeguards. Safety training involves fine-tuning the LLM with adversarial samples, which activate the LLM's capabilities against jailbreak. However, it is not always effective in countering new attacks and often leads to potential performance degradation. Safeguards, on the other hand, are methods using additional models to filter harmful content from the LLM's response. Nevertheless, they can only reduce a limited amount of harmful output and introduce extra computational costs. Given the distinct strengths and weaknesses of both, we combine them to balance out their flaws and propose a more effective method called \textsc{Self-Guard}. Specifically, we train the LLM to review its responses for any harmful content and append a \texttt{[harmful]} or \texttt{[harmless]} tag to the end of the response. In this way, \textsc{Self-Guard} possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness. Besides that, it gains flexibility like safeguards, making the safety check target the output side, which makes the system less vulnerable to attack updates. Experimental results indicate that our \textsc{Self-Guard} can effectively defend against jailbreak attacks and will not cause LLMs' performance degradation.
Paper Type: long
Research Area: NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies

Loading