Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li; Suchin Gururangan; Tim Dettmers; Mike Lewis; Tim Althoff; Noah A. Smith; Luke Zettlemoyer

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, Luke Zettlemoyer

15 Oct 2022 (modified: 14 Jul 2025)INTERPOLATE at NeurIPS 2022Readers: Everyone

Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for training of language models (LMs). BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different domain, such as scientific or legal text. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. These ELMs can be ensembled or averaged at inference time. Experiments show that BTM improves in- and out-of-domain perplexities as compared to compute-matched GPT-style transformer LMs. Our results suggest that extreme parallelism could be used to efficiently scale LMs in future work.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/branch-train-merge-embarrassingly-parallel/code)

0 Replies

Loading