Keywords: LLM Unlearning, Relearning Attacks, Constrained Fine-tuning
Abstract: Large language models face a critical vulnerability through relearning attacks, where adversaries exploit fine-tuning to restore knowledge that was intentionally removed via unlearning procedures. Current post-hoc safety evaluations detect violations only after fine-tuning completion, creating security gaps and computational waste.
We introduce a safety-constrained fine-tuning framework that proactively prevents relearning attacks by formulating defense as constrained optimization. Legitimate fine-tuning objectives are optimized subject to explicit constraints preventing restoration of forgotten knowledge. We present an efficient *Constraint-Aware Gradient Descent* algorithm that replaces intractable nonlinear constraints with first-order Taylor approximations, yielding convex quadratic subproblems with closed-form solutions.
Comprehensive experiments on Llama models demonstrate robust defense against relearning attack scenarios while maintaining legitimate fine-tuning performance.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22438
Loading