Enhanced Expert Merging for Mixture-of-Experts in Graph Foundation Models

Lei Liu; Xingyu Xia; Qianqian Xie; Ben Liu; Wenjie Xu; Min Peng

Enhanced Expert Merging for Mixture-of-Experts in Graph Foundation Models

Lei Liu, Xingyu Xia, Qianqian Xie, Ben Liu, Wenjie Xu, Min Peng

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Graph Representation Learning, Graph Foundation Model, Mixture of Experts, Expert Merging

TL;DR: Efficient expert merging for MoE-based graph foundation models approaching ensemble-level performance.

Abstract: Graph foundation models (GFMs) have emerged as a promising paradigm for learning transferable knowledge across diverse graph-structured data. The inherent heterogeneity in features and graph structures poses significant challenges for building scalable and generalizable GFMs. Existing research has employed mixture-of-experts (MoE) models to handle the challenges, assigning the most suitable expert to each graph. Despite this, the underlying mechanisms of MoE within the context of GFMs remain insufficiently explored. In this work, we conduct an in-depth experimental study on an MoE-based GFM and uncover an intriguing finding: the experts ranked second and third assigned by the router perform better than the top-ranked expert. This insight motivates us to investigate the potential of leveraging knowledge embedded across multiple experts. However, directly ensembling the outputs of multiple experts would incur substantial computational overhead, while applying a standard expert merging strategy risks suboptimal performance. To address these challenges, we introduce two enhanced expert merging strategies that retain the computational efficiency of expert merging, while improving performance to approach the effectiveness of expert ensembling. Specifically, we propose (i) a knowledge distillation-inspired expert merging method that aligns the behavior of parameter-fused experts with expert ensembles, and (ii) a theoretical parameter proximity approach that leverages the similarity of expert parameters to approximate ensemble outputs while preserving diversity. Extensive experiments demonstrate that our methods effectively enhance model performance.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 20431

Loading