Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li; Suchin Gururangan; Tim Dettmers; Mike Lewis; Tim Althoff; Noah A. Smith; Luke Zettlemoyer

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, Luke Zettlemoyer

Published: 01 Feb 2023, Last Modified: 04 Aug 2025Submitted to ICLR 2023Readers: Everyone

Keywords: sparsity, language model, efficient

TL;DR: We develop a new class of large language models that is embarrassingly parallel: different parts of the model are independently trained on different subsets of the data, with no need for multi-node training or inference.

Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent Expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; ensembles with random data splits do not perform well. Our results suggest that aggressive parallelism could be used to efficiently scale larger LMs in future work.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/branch-train-merge-embarrassingly-parallel/code)

10 Replies

Loading