Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang; Howe Tissue; Lu Wang; Xipeng Qiu

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce *Domain2Vec*, a novel approach that decomposes any dataset into a linear combination of several *meta-domains*, a new concept designed to capture the key underlying features of datasets. *Domain2Vec* maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of optimal data mixture for language model (LM) pretraining in a training-free manner under the ***D**istribution **A**lignment **A**ssumption* (DA$^{2}$), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved. Moreover, *Domain2Vec* can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that *Domain2Vec* helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, *Domain2Vec* achieves the same validation loss on Pile-CC using only $51.5$\% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, *Domain2Vec* improves downstream performance by an average of $2.83$\%.

Lay Summary: Finding the optimal data mixture for LLM pretraining is crucial yet challenging due to the high computational costs involved. We propose *Domain2Vec*, a novel approach that employs domain vectors to represent each dataset in LLM pretraining. With *Domain2Vec*, the mixture of pretraining datasets can be elegantly transformed into a linear combination of the predefined meta-domains. Under our proposed ***D**istribution **A**lignment **A**ssumption* (DA$^{2}$), these computed domain vectors enable us to identify optimal data mixtures without training. Furthermore, we demonstrate how *Domain2Vec* can be seamlessly integrated into existing data mixture optimization frameworks to enhance both efficiency and scalability. Our experimental results show that our method achieves comparable performance while significantly reducing computational overhead compared to existing approaches. We believe this work offers valuable insights into optimizing data mixtures for LLM pretraining and paves the way for more efficient training strategies.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Large Language Models

Keywords: Language models, Pretraining, Data mixutre

Submission Number: 7125

Loading