Defending ChatGPT against jailbreak attack via self-remindersDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 05 Feb 2024Nat. Mac. Intell. 2023Readers: Everyone
Abstract: Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users’ prompts in another standard prompt that reminds ChatGPT to respond responsibly.
0 Replies

Loading