CAMEx: Curvature-aware Merging of Experts

Dung Viet Nguyen; Minh Hoang Nguyen; Luc Nguyen; Rachel S.Y. Teo; Tan Minh Nguyen; Linh Duy Tran

CAMEx: Curvature-aware Merging of Experts

Dung Viet Nguyen, Minh Hoang Nguyen, Luc Nguyen, Rachel S.Y. Teo, Tan Minh Nguyen, Linh Duy Tran

Published: 06 Mar 2025, Last Modified: 04 Apr 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Mixture-of-Experts, efficiency, expert merging

TL;DR: We introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold

Abstract: Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are two-fold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models.

Submission Number: 6

Loading