Safety Evaluation is Highly Sensitive to Prompt Framing: An Inference-Only Study on HarmBench

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: safety evaluation, benchmark robustness, prompt framing, refusal behavior, HarmBench, evaluation reliability, measurement validity, large language models
TL;DR: Even benign prompt reframing can flip HarmBench refusal outcomes, exposing hidden instability in safety benchmark scores.
Abstract: Safety benchmarks are often treated as stable measurements of refusal behavior, but that assumption can fail even under minimal prompt reformatting. We study this effect with an inference-only protocol on HarmBench using the first 100 harmful instructions from the official text benchmark. For each instruction, we evaluate three fixed prompt framings: the original request, a fictional-story wrapper, and a translation wrapper. Under deterministic decoding with meta-llama/Meta-Llama-3-8B-Instruct, refusal rates vary from 0.74 for Direct framing to 0.53 for Translation framing, and all pairwise differences are significant under exact McNemar tests. The Framing Sensitivity Index (FSI), which measures how often refusal outcomes change across framings, is 0.24 with a 95% bootstrap confidence interval of [0.16, 0.33]. The effect remains under a single stochastic decoding pass at temperature 0.7, and a supplementary replication on mistralai/Mistral-7B-Instruct-v0.3 also shows non-zero framing sensitivity. We show that safety benchmark outcomes can vary substantially under minimal prompt reformatting, raising concerns about the robustness of current evaluation practices.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Non-archival
Submission Number: 35
Loading