Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang; Howe Tissue; Lu Wang; Xipeng Qiu

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, Pretraining, Data mixutre

Abstract: The mixture ratio of data from different source domains significantly affects the performance of language models (LM) pretraining. In this paper, we introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several ``Meta-Domains'', a new concept designed to capture key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of Meta-Domains and uses a Meta-Domain Classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of optimal data mixture ratio for LM pretraining in a training-free manner under the \textit{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA$^{2}$). Moreover, previous work could use \textsc{Domain2vec} to model the relationship between domain vectors and LM performance, greatly enhancing the scalability of previous methods without retraining as new datasets are introduced. Extensive experiments demonstrate that \textsc{Domain2Vec} finds data mixture ratios that enhance downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only $51.5\%$ of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of $2.72\%$. \textsc{Domain2Vec} serves as a strong and efficient baseline for data mixture optimization in LM pretraining, offering insights into improving data efficiency in large-scale models.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9648

Loading