SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance–Diversity Data Selection
Keywords: Large Language Models; Harmful Finetuning Attacks
Abstract: Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors.
We propose **SPARD**, a defense framework that integrates **S**afety-**P**rojected **A**lternating optimization with **R**elevance-**D**iversity aware data selection.
SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints.
To curate safe data, we introduce a Relevance–Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage.
Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12966
Loading