Turning Shields into Swords: Leveraging Safety Policies for LLM Safety Testing

Turning Shields into Swords: Leveraging Safety Policies for LLM Safety Testing

ICLR 2026 Conference Submission16787 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Testing, AI Safety, LLM Evaluation

TL;DR: We systematically test an LLM's adherence to its safety policy by converting it into formal logic to auto-generate test cases.

Abstract: The widespread integration of Large Language Models (LLMs) necessitates robust safety evaluation. However, current paradigms like manual red-teaming and static benchmarks are expensive, non-systematic, and fail to provide verifiable coverage of safety policies. To address these limitations, we introduce a novel framework that brings the rigor of specification-based software testing to AI safety. Our approach systematically generates harmful test cases by first compiling natural-language safety policies into a formal, first-order logic expression. This formal structure is used to construct a semantic graph where violation scenarios manifest as traversable subgraphs. By employing graph sampling, we systematically discover a diverse range of policy violations. These abstract scenarios are then instantiated into concrete, natural language queries using a generator LLM, a process that is automatic and flexibly adaptive to new domains. We demonstrate through experiments that our framework achieves higher policy coverage and generates more effective and interpretable test cases compared to established red-teaming baselines. By bridging formal methods and AI safety, our work provides a principled, scalable, and automated approach to ensuring LLMs adhere to safety-critical policies.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 16787

Loading