Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong; Mengzhou Xia; Danqi Chen; Mike Lewis

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Compute efficient LMs, Learning algorithms for LMs

Keywords: Mixture of Experts, Language Modeling, Language Model Pre-training

TL;DR: We introduce Lory, a fully-differentiable MoE architecture, designed for autoregressive language model pre-training

Abstract: Mixture-of-experts (MoE) models facilitate efﬁcient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture SMEAR was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space. Nevertheless, its effectiveness was only demonstrated in downstream ﬁne-tuning on classiﬁcation tasks. In this paper, we present Lory, a novel approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efﬁciency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models from scratch on 150B tokens, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show signiﬁcant performance gains over parameter-matched dense models in both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 746

Loading