Keywords: jailbreak prompt, intervention
Abstract: Jailbreak prompts, which are the inputs made to make LLMs create unsafe content, are a critical threat to the safety deployment of LLMs. Traditional jailbreak methods depend on optimizing malicious prompts to generate an affirmative initial response, assuming the continuation of harmful generation. Yet, the effectiveness of these initial responses can vary, impacting the likelihood of subsequent harmful output. This work focuses on the importance of selecting the proper initial response and the difficulties that come with it. We propose a new method that uses model steering to effectively choose the initial response that may lead to successful attacks. Our experiments show that this method can greatly improve how accurately we choose the proper initial responses, leading to high attack success rates.
Submission Number: 46
Loading