An Embarrassingly Simple Defense Against LLM Abliteration Attacks

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

ACL ARR 2026 January Submission8590 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Safety, Alignment, Refusal, Abliteration

Abstract: Large language models (LLMs) are typically aligned to refuse harmful instructions through safety fine-tuning. A recent attack, termed abliteration, identifies and suppresses the single latent direction most responsible for refusal behavior, thereby enabling models to generate harmful content. We propose a defense that fundamentally alters how models express refusal. We construct an extended-refusal dataset in which responses to harmful prompts provide detailed justifications before refusing, distributing the refusal signal across multiple token positions. Fine-tuning multiple models from different architectures and sizes on this dataset yields models that maintain high refusal rates under abliteration: refusal rates drop by at most 10%, compared to 70–80% drops in baseline models. Comprehensive evaluations of safety and utility demonstrate that extended-refusal fine-tuning effectively neutralizes abliteration attacks while preserving general model performance and enhancing robustness across multiple alignment scenarios. Our dataset and models will be made publicly available.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Large Language Models, Safety, Alignment, Refusal, Abliteration

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 8590

Loading