BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, mixture of experts
Abstract: Training Mixture of Experts (MoEs) from scratch in a large-scale regime is expensive. Previous work addresses this challenge by independently training multiple dense expert models and using them to initialize an MoE. In particular, initializing MoE layers using experts' feed-forward parameters while merging all other parameters. This limits the advantages of the specialized dense models when ``upcycling'' them as MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective improvement to MoE training. BAM makes full use of specialized dense models by not only using their feed-forward network (FFN) to initialize the MoE layers but also leveraging experts’ attention weights fully by initializing them as Mixture of Attention (MoA) layers. Our experiments using seed models ranging from 590 million to 2 billion parameters show that our approach outperforms state-of-the-art approaches under the same data and compute budget in both perplexity and downstream tasks evaluations.
Submission Number: 14
Loading