Keywords: data mixing, language models, model development, data curation
TL;DR: Olmix makes data mixing practical for real-world LM development by 1) providing clear guidance on how to configure mixing methods and 2) proposing methods for efficiently recomputing mixtures as datasets evolve throughout LM development.
Abstract: Data mixing---determining the ratios of data from different domains---is a first-order concern for training language models (LMs), but existing mixing methods have poorly understood design choices and assume that the set of domains remain fixed throughout development. We present Olmix, a framework that addresses two challenges encountered during LM development. First, the configuration space for developing a mixing method is not well understood---design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, the domain set evolves throughout LM development as datasets are revised and expanded---a problem setting largely unaddressed by existing works. We study how to efficiently recompute the mixture after the domain set is updated, given an existing mix from before the update. We introduce mixture reuse, a mechanism that reuses existing relative ratios and recomputes ratios only for domains affected by an update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 110
Loading