TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: harmful fine-tuning, llm, safety alignment, peft training, safety, adversarial robustness, benchmark
TL;DR: We introduce TamperBench: a systematic framework to stress-test and evaluate LLM safety under fine-Tuning and tampering by providing a collection of tampering attacks along with safety and utility evaluators.
Abstract: As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and inconsistent threat settings make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of weight-space fine-tuning attacks and latent-space representation attacks; (ii) allows for testing state-of-the-art tamper-resistance defenses; and (iii) provides both safety and utility evaluations. TAMPERBENCH requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. In this work, we use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. Code is available at: https://anonymous.4open.science/r/TamperBench-71DD/README.md
Primary Area: datasets and benchmarks
Submission Number: 9779
Loading