Keywords: Model merging, Data mixing, Alignment, Safety, Multilingual, Large Language Models
Abstract: Large Language Models (LLMs) are increasingly being used worldwide across a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms prevalent in Western-centric datasets, and safety protocols frequently fail to extend to multilingual settings. In this work, we explore merging models trained with diverse safety data as a method to enhance safety across languages in comparison to data mixing strategies.
We observe substantial gains from merging, with improvements in safety and general performance across six languages - up to 10% and 8%, respectively. We also extend the multilingual coverage of models by combining monolingual models, resulting in approximately 7% improvement in safety and 4% in general performance. Our experiments demonstrate that not all merging algorithms consistently yield improvements, particularly in balancing the contrasting dual-objective of safety and general performance in a multilingual context. Overall, our comparison reveals that model merging generally outperforms data mixing in achieving a balance between safety and general performance.
Submission Number: 204
Loading