Stealing Black-box LLM Security Logic: Predicting Jailbreak Attack Success Rate via Comparative Safety Proxies

ACL ARR 2026 January Submission1096 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Jailbreak, Proxy model
Abstract: Large language models (LLMs) remain vulnerable to instruction-level jailbreak attacks. Predicting when and why a black-box LLM produces unsafe outputs would enable more efficient vulnerability discovery and red-teaming optimization, yet directly regressing Attack Success Rate (ASR) with lightweight proxy models is hindered by high sampling costs and poor generalization to unseen harmful topics. In this work, we propose CompSP, a lightweight Safety Proxy that predicts the relative ordering of ASR between two prompts derived from unseen harmful topics for a black-box LLM. To support effective proxy training and evaluation, we introduce the Outline Filling Attack (OFA), a highly diverse jailbreak method that elicits unsafe responses within 90 tokens across 99\% of harmful topics. Using OFA-generated prompts, we show that CompSP can be trained effectively and reliably guides jailbreak optimization. Experiments demonstrate that CompSP achieves 0.73 accuracy in ASR ordering prediction and 0.70 accuracy on prompt pairs with non-extreme ASR values. Guided by CompSP, attack success rates improve by over 40\% on commercial black-box models, including GPT-4o-mini and Qwen-Plus. These results indicate that the safety decision boundaries of black-box LLMs are distillable, highlighting a critical security risk in current alignment mechanisms.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: adversarial attacks/examples/training, robustness, probing, safety and alignment, red teaming, security and privacy
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1096
Loading