Keywords: model evaluation, safeguard removal, refusal removal, refusal ablation, supervised finetuning
TL;DR: We open-source the Safety Gap Toolkit to measure the “safety gap”—the difference in dangerous capabilities between safeguarded and modified state-of-the-art models.
Abstract: Open-weight LLMs enable innovation and democratization but introduce systemic risks: bad actors can trivially remove safeguards, creating a "safety gap''-the difference in dangerous capabilities between safeguarded and modified models.
We open-source a toolkit to measure this gap across state-of-the-art models.
Testing Llama-3 and Qwen-2.5 families (0.5B--405B parameters) on biochemical and cyber capabilities, we find the safety gap widens with model scale, with dangerous capabilities increasing substantially post-modification.
The Safety Gap Toolkit provides an evaluation framework for open-source models and motivates tamper-resistant safeguard development.
Submission Number: 66
Loading