TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

ICLR 2026 Conference Submission9779 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: harmful fine-tuning, llm, safety alignment, peft training, safety, adversarial robustness, benchmark
TL;DR: We introduce TamperBench: a systematic framework to stress-test and evaluate LLM safety under fine-Tuning and tampering by providing a collection of tampering attacks along with safety and utility evaluators.
Abstract: As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize AI-associated risks. However, the number of safety alignment approaches to combat tamper resistance has resulted in diverse evaluations: Varied data sets, metrics, and inconsistent threat settings make it difficult to fairly compare safety, utility, and robustness across different defenses. To this end, we introduce TamperBench, a unified framework to provide an evaluation of the tamper-resistance of LLMs. TamperBench (i) curates a repository of weight-space fine-tuning attacks and latent-space representation attacks; (ii) allows for testing state-of-the-art tamper-resistance defenses; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We showcase its value by evaluating numerous safety-aligned open-weight LLMs across tampering attacks with varied configurations. To the best of our knowledge, TamperBench is the first large-scale evaluation framework for assessing tamper resistance in LLMs. In this work, we evaluate 19 open-weight LLMs, including defense augmented variants, across nine tampering regimes using standardized safety and capability metrics with hyperparameter sweeps per model–attack pair. Code is available at: https://anonymous.4open.science/r/TamperBench-71DD/README.md
Primary Area: datasets and benchmarks
Submission Number: 9779
Loading