Keywords: Language Model
Abstract: We introduce FlexOLMo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on private datasets, and (2) data-flexible inference, where these parameters along with their associated data can be easily included or excluded from model inferences with no further training. FlexOLMo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on private datasets and later integrated through a new nonparametric routing without any joint training across datasets. FlexOLMo is trained on FLEXMIX, a corpus we curate comprising seven restricted sets, either real or realistic approximations, alongside publicly available datasets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners significantly benefiting from these restricted sets (an average 41% relative improvement) while allowing flexible opt-out at inference time (e.g., for users without appropriate licenses or permissions). Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, FlexOLMo enables training on restricted data while keeping data local and supports fine-grained control of data access at inference.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 18204
Loading