Keywords: Large Language Models, Jailbreaking, Adversarial Attacks, AI Safety
Abstract: As Large Language Models (LLMs) become indispensable assistants, they remain vulnerable to misuse.
Jailbreaking is an essential adversarial technique for red-teaming models to uncover and patch security flaws.
However, existing jailbreak methods suffer from significant limitations.
Token-level jailbreak attacks often produce incoherent or unreadable inputs and exhibit poor transferability, while prompt-level attacks lack scalability and rely heavily on manual effort and human ingenuity.
We propose AGILE, a concise and effective two-stage framework that combines the advantages of these approaches.
The first stage performs a one-shot, scenario-based generation of context and rephrases the original malicious query to obscure its harmful intent.
The second stage utilizes information from the model's hidden states to guide fine-grained edits, effectively steering the model's internal representation of the input from a malicious one toward a benign one.
Extensive experiments demonstrate that AGILE achieves state-of-the-art Attack Success Rate, with gains of up to 37.74\% over the strongest baseline, and AGILE exhibits excellent transferability to black-box and large-scale models.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 2539
Loading