Abstract: We present Branch-Train-Merge (BTM), a communication-efficient algorithm for training of language models (LMs). BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different domain, such as scientific or legal text. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. These ELMs can be ensembled or averaged at inference time. Experiments show that BTM improves in- and out-of-domain perplexities as compared to compute-matched GPT-style transformer LMs. Our results suggest that extreme parallelism could be used to efficiently scale LMs in future work.