Adversarial Prompt Distillation: Efficient Jailbreak Attacks from Large to Small Language Models

ACL ARR 2026 January Submission286 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation​, ​​Jailbreak Attack​
Abstract: Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4\% ASR$_k$ on GPT-4) while dramatically improving efficiency—generating prompts 3.7$\times$ faster with 11.3$\times$ fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://anonymous.4open.science/r/45E6F8D
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: ​Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation​, ​​Jailbreak Attack​
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 286
Loading