Adversarial Prompt Distillation: Efficient Jailbreak Attacks from Large to Small Language Models

Adversarial Prompt Distillation: Efficient Jailbreak Attacks from Large to Small Language Models

ACL ARR 2026 January Submission286 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation, Jailbreak Attack

Abstract: Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4\% ASR$_k$ on GPT-4) while dramatically improving efficiency—generating prompts 3.7$\times$ faster with 11.3$\times$ fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://anonymous.4open.science/r/45E6F8D

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Knowledge Distillation, Small Language Models (SLMs), Prompt Distillation, Jailbreak Attack

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 286

Loading