Fine-tuning with Harmfulness Probes Leads to Natural Refusals

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Applications of interpretability, Interpretability for AI Safety
TL;DR: Harmfulness probes can serve as a supervision signal for post-training models to refuse harmful requests while preserving monitor-readable safety signals.
Abstract: Linear probes on residual-stream activations can detect harmful content in language model generations, but their use as a training signal for instilling safe behavior is largely unexplored. We study probe-guided fine-tuning under a KL anchor, starting from instruction-tuned models whose refusal mechanism has been removed by directional ablation or was never present, and compare three regimes for the probe itself: frozen, warm-retrained, or reinitialized at every step. Frozen probes preserve utility but leave generations largely harmful: the model evades them by translating activations across a fixed decision boundary while the harmful feature itself remains linearly encoded. Adaptive probes, both warm-retrain and reinit, reduce harmful compliance substantially at modest utility cost, and the resulting checkpoints score well below the abliterated base under StrongReject on both direct querying and GCG. Rather than producing explicit refusals, these checkpoints soft-refuse by reinterpreting harmful prompts benignly, or pivot sycophantically to unrelated benign content. Mechanistically, adaptive probes track the moving harmfulness direction, so a freshly fit linear probe still separates harmful from benign activations at every layer, leaving the signal that downstream monitors rely on intact.
Submission Number: 737
Loading