Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: alignment, efficiency, model merging, model interpolation, safety
TL;DR: An efficient and effective method to align domain expert models for safety without compromising their utility
Abstract: There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called MergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply MergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.
Submission Number: 132
Loading