Abstract: The Mixture-of-Experts (MoE) architecture plays a crucial role in scaling Large Language Models to trillions of parameters without incurring excessive computational costs. Although utilizing these powerful models for Multi-Task Learning (MTL) is an attractive objective, training a single model on diverse tasks without proper consideration can often result in performance degradation due to task interference and negative transfer. To tackle this issue, we came up with Complexity-Aware Expert Merging (CAEM), a novel strategy that uses the entropy of expert utilization as an indicator of task complexity. This approach allows for a strategic allocation of expert resources, overcoming common MTL bottlenecks. Our method achieves a significant improvement over standard MTL baselines, exemplified by a 6.47\% ROUGE-L gain on the complex XSum task with only negligible trade-offs on other simpler tasks (8-experts case) because of the superior starting point in the loss landscape and path dependency in optimization. This observation led us to identify a more general principle: a "Founder Effect" in model merging. CAEM not only provides a resource-efficient path to high-performance MTL but also provides insights into the mechanisms of model merging.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: multi-task learning, transfer learning / domain adaptation, representation learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 177
Loading