BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Ce Ge; Zhijian Ma; Daoyuan Chen; Yaliang Li; Bolin Ding

BiMix: Bivariate Data Mixing Law for Language Model Pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Mixture, Large Language Models, Scaling Law

TL;DR: We introduce a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining.

Abstract: Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces BiMix, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. BiMix provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate BiMix's high accuracy in loss extrapolation (mean relative error $< 0.2\%$) and its generalization to unseen mixtures (R$^{2} > 0.97$). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6713

Loading