AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization

ACL ARR 2025 July Submission389 Authors

27 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Low-rank adaptation (LoRA) has become a standard tool for efficiently fine-tuning large language models (LLMs). Yet, even minor LoRA updates can induce *alignment drift*, weakening safety and behavioral constraints through entangled parameter changes. To address this, we propose **AlignGuard-LoRA (AGL)**, a principled framework for preserving alignment during finetuning. **AGL** introduces several key components: a primary task loss for supervision, **Fisher Information Matrix-based regularization** to restrict updates in alignment-sensitive subspaces, and **task-specific regularization** to stabilize the integration of new knowledge. We further introduce **collision-aware regularization**, blending **Riemannian overlap**—which penalizes coordinate-wise interference—and **geodesic separation**—which encourages disjoint update geometry. We curate **DriftCaps**, a targeted diagnostic benchmark of safe and unsafe prompts designed to quantify alignment drift and safety degradation. Empirical evaluations show that **AGL** mitigates alignment drift by up to **50%** on safety-critical benchmarks without degrading downstream task performance. Comprehensive ablation confirms that each component contributes distinctly to preserving latent safety behaviors. Finally, we derive and validate a **scaling law for catastrophic forgetting**, revealing that **AGL** flattens post-finetuning loss escalation while preserving adaptation dynamics. **AGL** is a structurally grounded refinement of LoRA, ensuring alignment preservation with minimal trade-offs. To encourage further exploration and development, we open-source our implementation.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Alignment Preservation, Safety-Aware Fine-Tuning, Catastrophic Forgetting & Scaling Laws, Diagnostic Benchmarking
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 2: DRIFTCHECK: Diagnosing Alignment Drift
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Mentioned in the Abstract: To encourage further exploration and development, we open-source our implementation at https://anonymous.4open.science/r/alignguard-1056/.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 2: DRIFTCHECK: Diagnosing Alignment Drift
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 2: DRIFTCHECK: Diagnosing Alignment Drift
B6 Statistics For Data: Yes
B6 Elaboration: Section 2: DRIFTCHECK: Diagnosing Alignment Drift
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix E: Implementation and Hyperparameter Tuning
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix E: Implementation and Hyperparameter Tuning
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5: Performance of ALIGNGUARD-LORA
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix E: Implementation and Hyperparameter Tuning
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 389
Loading