Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs
Keywords: safety alignment, task similarity, guardrail durability
Abstract: Through supervised fine-tuning or reinforcement learning with human feedback, large language models can achieve a certain level of safety alignment during instruction fine-tuning. However, these *safety guardrails* are often fragile, as models can easily generate harmful content after downstream fine-tuning. Although various methods have been proposed to mitigate this, our paper shifts focus to the durability of safety guardrails, beginning with their formation in the upstream alignment stages. The central question we explore is: *Can we construct more durable safety guardrails for specific downstream tasks to ensure models remain safe after fine-tuning?* Our experiments demonstrate that the durability of these safety guardrails is closely tied to the similarity between upstream and downstream datasets: higher similarity results in more fragile guardrails after fine-tuning, whereas lower similarity results in more durable guardrails. This finding highlights the importance of dataset diversity and privacy in upstream alignment data. Ensuring the diversity of the alignment dataset, which allows downstream datasets to be less similar to it, enhances the guardrail durability for fine-tuning. Maintaining its privacy prevents the exposure of alignment data that adversaries could exploit. Thus, we advocate for a dual strategy: prioritizing both the privacy and diversity of upstream alignment datasets to fortify safety guardrails against potential threats, ensuring long-term model robustness in real-world applications.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 950
Loading