DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

TMLR Paper6661 Authors

26 Nov 2025 (modified: 06 Mar 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Optimal pre-training data mixture is pivotal for large language model (LLM) performance, but searching for the best domain weights is computationally expensive. We present Domain Weight Randomization with Bayesian Updating (DRAW), a principled framework treating domain weights as Dirichlet-distributed random variables whose parameters scale with model width. Informative priors are first estimated using proxy models; the main model then refines these using Bayesian inference and parameter scaling, dynamically sampling domain weights during training. Theoretically, DRAW reduces generalization error at a rate $\mathcal{O}(1/\sqrt{n})$ as model width increases, ensuring stable convergence. Empirical results on open-domain corpora and diverse benchmarks show DRAW reliably outperforms fixed and adaptive baselines in both language modeling and downstream tasks, achieving better average and worst-case performance alongside strong robustness. DRAW not only highlights valuable data domains while suppressing noisy ones, but also introduces a scalable and effective mechanism for adaptive data mixing in LLM pre-training, facilitating efficient knowledge transfer from proxy to large models.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Marwa_El_Halabi1

Submission Number: 6661

Loading