TamperTest: A Framework for Testing Tamper Resistance in Open-Weight LLMs

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: ml, machine learning, tamper resistance, adversarial robustness
TL;DR: We provide an extensible, Inspect-based framework for benchmarking tamper resistance of open-weight LLMs.
Abstract: As open-weight models proliferate, the fragility of their safety alignment under downstream fine-tuning has become a critical vulnerability. We introduce TamperTest, an extensible, Inspect-based framework for benchmarking tamper resistance: a model’s ability to uphold safety constraints during adversarial fine-tuning while maintaining general capabilities. While traditional evaluations assess safety and capabilities only before and after adversarial attacks, TamperTest monitors model behavior during the entire fine-tuning trajectory to better capture a model’s degradation profile. Central to our framework is the tamper resistance integral (TRI), a metric allowing for principled comparison of tamper resistance between different models. We benchmark several open-weight models, revealing failure modes in existing tamper defenses. Our code is publicly available at https://github.com/isabeldahlgren/tamper-test.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 217
Loading