The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model evaluation, safeguard removal, refusal removal, refusal ablation, supervised finetuning
TL;DR: We open-source the Safety Gap Toolkit to measure the “safety gap”—the difference in dangerous capabilities between safeguarded and modified state-of-the-art models.
Abstract: Open-weight LLMs enable innovation and democratization but introduce systemic risks: bad actors can trivially remove safeguards, creating a "safety gap''-the difference in dangerous capabilities between safeguarded and modified models. We open-source a toolkit to measure this gap across state-of-the-art models. Testing Llama-3 and Qwen-2.5 families (0.5B--405B parameters) on biochemical and cyber capabilities, we find the safety gap widens with model scale, with dangerous capabilities increasing substantially post-modification. The Safety Gap Toolkit provides an evaluation framework for open-source models and motivates tamper-resistant safeguard development.
Submission Number: 66
Loading