Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language ModelsDownload PDF

15 Oct 2022 (modified: 21 Jul 2024)INTERPOLATE at NeurIPS 2022Readers: Everyone
Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for training of language models (LMs). BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different domain, such as scientific or legal text. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. These ELMs can be ensembled or averaged at inference time. Experiments show that BTM improves in- and out-of-domain perplexities as compared to compute-matched GPT-style transformer LMs. Our results suggest that extreme parallelism could be used to efficiently scale LMs in future work.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](
0 Replies