H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

ACL ARR 2025 May Submission2106 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The safety alignment of pre-trained LLMs continues to attract attention from both industry and academic research. This paper presents H\textsuperscript{3}Fusion, a mixture-of-expert (MoE) fusion approach to optimize safety alignment performance with three unique characteristics: (1) H\textsuperscript{3}Fusion creates a robust alignment by integrating multiple independently aligned LLMs for helpfulness, harmlessness, honesty respectively, enabling fusion-enhanced capabilities beyond each individual model. (2) H\textsuperscript{3}Fusion develops a mixture-of-experts (MoE) based fusion methodology with two unique features: We first freeze the multi-head attention weights of each individual model while tuning the feed-forward network (FFN) layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. (3) H\textsuperscript{3}Fusion is to introduce gating loss and regularization terms to further boost the performance of the resulting H\textsuperscript{3}Fusion model. Extensive evaluations of three benchmark datasets show that H\textsuperscript{3}Fusion is more helpful, less harmful, and more honest in two aspects: it outperforms each individually aligned model by $11.37\%$, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at {\small \url{https://anonymous.4open.science/r/h3fusion-F45E/}}.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment,fine-tuning,robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 2106

Loading