TamperBench: A Systematic Framework to Stress-Test LLM Safety Under Fine-Tuning and Tampering

Saad Hossain; Tom Tseng; Punya Syon Pandey; Samanvay Vajpayee; Nayeema Nonta; Matthew Kowal; Samuel Simko; Stephen Casper; Zhijing Jin; Kellin Pelrine; Sirisha Rambhatla

TamperBench: A Systematic Framework to Stress-Test LLM Safety Under Fine-Tuning and Tampering

Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Nayeema Nonta, Matthew Kowal, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: harmful fine-tuning, llm, safety alignment, peft training, safety, adversarial robustness, benchmark

TL;DR: We introduce TamperBench: a systematic framework to stress-test and evaluate LLM safety under fine-Tuning and tampering by providing a collection of tampering attacks along with safety and utility evaluators.

Abstract: As open-weight LLMs are increasingly deployed—including in agentic systems—their safety depends on tamper resistance to downstream modifications that weaken safeguards, whether accidental or intentional. Yet tamper resistance lacks standardized evaluation: prior studies vary in datasets, metrics, and tampering configurations, making results difficult to compare across models and defenses. We introduce TamperBench, a unified framework that consolidates weight-space and representation-space tampering attacks, supports realistic adversarial evaluation via systematic hyperparameter sweeps, and jointly measures safety and utility with reproducible protocols. Using TamperBench, we benchmark 21 open-weight LLMs (including defense-augmented variants) across nine tampering threats and find that jailbreak-tuning is typically the most severe attack, that base vs. post-trained variants can differ in out-of-the-box tamper resistance (with opposite trends across Llama-3 and Qwen3), and that Triplet is often the most robust and capability-preserving defense. Code is available at: https://github.com/criticalml-uw/TamperBench.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 191

Loading