Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models
Keywords: Prefix Tuning, Safety
Abstract: We propose \textit{safety policy patching}, a lightweight and modular approach for addressing safety vulnerabilities in large language models (LLMs) between major releases. Major version updates are costly, infrequent, and difficult to tailor to customer needs, leaving deployed models with known safety gaps. Our method enables rapid remediation by prepending a compact, learnable prefix to an existing model's inputs.This patch introduces very few additional parameters---e.g., $0.003\%$ for LLaMA-2---yet reliably steers model behavior toward that of a safer reference model. Across three critical domains---toxicity mitigation, bias reduction, and harmfulness refusal---policy patches achieve safety improvements comparable to stronger safety-aligned models (e.g., future major releases) while preserving fluency. Overall, we show that LLMs can be ``patched'' much like software, providing vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.
Submission Number: 99
Loading