CAEM-MoE: Complexity-Aware Expert Merging for Efficient Multi-Task Learning in Mixture of Experts

CAEM-MoE: Complexity-Aware Expert Merging for Efficient Multi-Task Learning in Mixture of Experts

ACL ARR 2025 July Submission177 Authors

24 Jul 2025 (modified: 02 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The Mixture-of-Experts (MoE) architecture plays a crucial role in scaling Large Language Models to trillions of parameters without incurring excessive computational costs. Although utilizing these powerful models for Multi-Task Learning (MTL) is an attractive objective, training a single model on diverse tasks without proper consideration can often result in performance degradation due to task interference and negative transfer. To tackle this issue, we came up with Complexity-Aware Expert Merging (CAEM), a novel strategy that uses the entropy of expert utilization as an indicator of task complexity. This approach allows for a strategic allocation of expert resources, overcoming common MTL bottlenecks. Our method achieves a significant improvement over standard MTL baselines, exemplified by a 6.47\% ROUGE-L gain on the complex XSum task with only negligible trade-offs on other simpler tasks (8-experts case) because of the superior starting point in the loss landscape and path dependency in optimization. This observation led us to identify a more general principle: a "Founder Effect" in model merging. CAEM not only provides a resource-efficient path to high-performance MTL but also provides insights into the mechanisms of model merging.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: multi-task learning, transfer learning / domain adaptation, representation learning

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 177

Loading