Abstract: The safety alignment of pre-trained LLMs continues to attract attention from both industry and academic research. This paper presents H\textsuperscript{3}Fusion, a mixture-of-expert (MoE) fusion approach to optimize safety alignment performance with three unique characteristics: (1) H\textsuperscript{3}Fusion creates a robust alignment by integrating multiple independently aligned LLMs for helpfulness, harmlessness, honesty respectively, enabling fusion-enhanced capabilities beyond each individual model. (2) H\textsuperscript{3}Fusion develops a mixture-of-experts (MoE) based fusion methodology with two unique features: We first freeze the multi-head attention weights of each individual model while tuning the feed-forward network (FFN) layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. (3) H\textsuperscript{3}Fusion is to introduce gating loss and regularization terms to further boost the performance of the resulting H\textsuperscript{3}Fusion model. Extensive evaluations of three benchmark datasets show that H\textsuperscript{3}Fusion is more helpful, less harmful, and more honest in two aspects: it outperforms each individually aligned model by $11.37\%$, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at {\small \url{https://anonymous.4open.science/r/h3fusion-F45E/}}.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment,fine-tuning,robustness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 2106
Loading