Keywords: model abliteration, safety pretraining, refusal robustness, inference-time editing, open-weight LLMs
TL;DR: Simple inference-time model abliteration collapses refusal-only safety, while multi-signal safety pretraining ingredients (safe-only filtering + rephrase + metatags + refusals) remains markedly more robust across open-weight LLMs.
Abstract: Open-weight LLMs can be modified at inference time with simple activation
edits, which raises a practical question for safety: do common safety interventions
like refusal training or metatag training survive such edits? We study model
abliteration, a lightweight projection technique designed to remove refusal-sensitive
directions, and conduct a controlled evaluation across a granular sequence of
Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open
baselines. For each of 20 systems, original and abliterated, we issue 100 prompts
with balanced harmful and harmless cases, classify responses as REFUSAL or
NON-REFUSAL using multiple judges, and validate judge fidelity on a small
human-labeled subset. We also probe whether models can identify refusal in their
own outputs. Our study produces a checkpoint-level characterization of which
data-centric safety components remain robust under abliteration, quantifies how
judge selection influences evaluation outcomes, and outlines a practical protocol
for integrating inference-time edits into safety assessments. Code:https://github.com/shashankskagnihotri/safety_pretraining .
Submission Number: 27
Loading