ShieldBench: A Comprehensive Benchmark for Evaluating the Persistence of LLM Safety Interventions

Published: 08 Nov 2025, Last Modified: 24 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety and Alignment, Model Robustness, Safety Evaluation, Benchmarking, Weight-space Editing, Responsible AI
TL;DR: ShieldBench evaluates how well safety interventions in large language models persist under realistic conditions, revealing that lasting safety depends on model weight-space geometry.
Abstract: Large Language Models (LLMs) are increasingly relied upon for information access and decision support, yet they continue to struggle with distinguishing between benign and harmful prompts. Existing evaluation protocols fall short: some rely on unrealistic assumptions, while others provide only partial assessments of model safety. We introduce ShieldBench, a benchmark designed to evaluate not only immediate safety compliance but also the persistence of safety interventions under realistic usage conditions. Our benchmark incorporates a suite of recent weight-space editing techniques (Task-Vector Negation, Diverse Inversion, Guided Distortion, AlphaEdit, SafetyLora, and TaLoS Sparsity) applied across multiple open-source models and diverse safety datasets like HarmBench. By evaluating performance under both greedy and sampling-based decoding, we capture conditions closer to real world deployments. Our results reveal persistence depends critically on weight-space geometry, providing actionable insights for building durable LLM safety.
Submission Number: 136
Loading