Black-Box Guardrail Reverse-engineering Attack

12 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Guardrail Reverse-engineering
Abstract: Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful or undesirable responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present GRA, the first black-box guardrail reverse-engineering attack. Without requiring access to internal configurations, GRA combines iterative querying, reinforcement learning, and genetic algorithm–based data augmentation to approximate the decision boundaries of target guardrails. By adaptively training a surrogate guardrail, our method achieves high-fidelity replication of the victim guardrail’s behavior. To systematically evaluate this attack, we construct a legal–moral question–answering dataset designed to measure rule-reversal performance. Extensive experiments across two benchmark datasets and three commercial LLM systems, including ChatGPT, DeepSeek, and Qwen, demonstrate that GRA achieves a rule-matching rate exceeding 0.92 while requiring less than $85 in API costs. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4327
Loading