DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

Published: 09 Apr 2026, Last Modified: 21 Apr 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Optimal pre-training data mixture is pivotal for large language model (LLM) performance, but searching for the best domain weights is computationally expensive. We present Domain Weight Randomization with Bayesian Updating (DRAW), a principled framework treating domain weights as Dirichlet-distributed random variables whose parameters scale with model width. Informative priors are first estimated using proxy models; the main model then refines these using Bayesian inference and parameter scaling, dynamically sampling domain weights during training. Theoretically, DRAW reduces generalization error at a rate $\mathcal{O}(1/\sqrt{n})$ as model width increases, ensuring stable convergence. Empirical results on open-domain corpora and diverse benchmarks show DRAW reliably outperforms fixed and adaptive baselines in both language modeling and downstream tasks, achieving better average and worst-case performance alongside strong robustness. DRAW not only highlights valuable data domains while suppressing noisy ones, but also introduces a scalable and effective mechanism for adaptive data mixing in LLM pre-training, facilitating efficient knowledge transfer from proxy to large models.
Submission Type: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Changes Since Last Submission: EiC revision: added the corresponding author information and funding acknowledgment to the paper. --- In the revised manuscript, we clarified the scope of our theoretical claims and softened several overly strong statements. We added methodological clarifications in Section 3.1 on (i) why scaling the Dirichlet concentration for larger models yields a sharper, lower-variance distribution around the same proxy-informed mean, and (ii) how DRAW relates to Group DRO. We also added a discussion in Section 2 clarifying how DRAW differs from online dynamic refinement methods such as Jiang et al. In particular, we revised the presentation of Theorems 1 and 2 to better explain the assumptions under which they hold, the intuition they provide, and the limits of their implications for real-world LLM pre-training. We further added a dynamic-vs.-static reweighting comparison in Section 4.2, showing that dynamic DRAW outperforms a static proxy-informed variant (3.5002 vs. 3.6415 average validation loss in the 70M →150M setting). In addition, we expanded the discussion of downstream results that are close to chance level, clarified training-step consistency, and strengthened the limitations discussion in Section 5.2, including a clearer account of the method’s assumptions and the fact that the current Dirichlet formulation does not explicitly model structured cross-domain interactions. Throughout the paper, we also revised the wording to better align our claims with the theoretical and empirical evidence.
Assigned Action Editor: ~Marwa_El_Halabi1
Submission Number: 6661
Loading