Keywords: harmful fine-tuning, llm, safety alignment, peft training, safety, adversarial robustness, benchmark
TL;DR: We introduce TamperBench: a systematic framework to stress-test and evaluate LLM safety under fine-Tuning and tampering by providing a collection of tampering attacks along with safety and utility evaluators.
Abstract: As open-weight LLMs are increasingly deployed—including in agentic systems—their safety depends on tamper resistance to downstream modifications that weaken safeguards, whether accidental or intentional. Yet tamper resistance lacks standardized evaluation: prior studies vary in datasets, metrics, and tampering configurations, making results difficult to compare across models and defenses. We introduce TamperBench, a unified framework that consolidates weight-space and representation-space tampering attacks, supports realistic adversarial evaluation via systematic hyperparameter sweeps, and jointly measures safety and utility with reproducible protocols. Using TamperBench, we benchmark 21 open-weight LLMs (including defense-augmented variants) across nine tampering threats and find that jailbreak-tuning is typically the most severe attack, that base vs. post-trained variants can differ in out-of-the-box tamper resistance (with opposite trends across Llama-3 and Qwen3), and that Triplet is often the most robust and capability-preserving defense. Code is available at: https://anonymous.4open.science/r/TamperBench-71DD.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 191
Loading