Keywords: Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation, Jailbreak Attack
Abstract: Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4\% ASR$_k$ on GPT-4) while dramatically improving efficiency—generating prompts 3.7$\times$ faster with 11.3$\times$ fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://anonymous.4open.science/r/45E6F8D
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation, Jailbreak Attack
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 286
Loading