Keywords: tamper resistance, adversarial robustness
TL;DR: We provide an extensible, Inspect-based framework for benchmarking tamper resistance of open-weight LLMs.
Abstract: As open-weight models proliferate, the fragility of their safety alignment under downstream fine-tuning has become a critical vulnerability. We introduce TamperTest, an extensible, Inspect-based framework for benchmarking tamper resistance: a model’s ability to uphold safety constraints during adversarial fine-tuning while maintaining general capabilities. While traditional evaluations assess safety and capabilities only before and after adversarial attacks, TamperTest monitors model behavior during the entire fine-tuning trajectory to better capture a model’s degradation profile. Central to our framework is the tamper resistance integral (TRI), a metric allowing for principled comparison of tamper resistance between different models. We benchmark several open-weight models, revealing failure modes in existing tamper defenses. Our code is publicly available at https://github.com/isabeldahlgren/tamper-test.
Submission Number: 78
Loading