R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data mixing, language models, multimodal models, compute efficiency, data efficiency
TL;DR: We introduce a two-stage framework for enhancing model training on diverse data, first by clustering data then optimizing domain weights.
Abstract: While data mixing strategies have successfully reduced training costs, existing methods suffer from two critical flaws: they rely on predetermined data domains that may fail to capture semantic nuances, and they scale computationally with the number of domains in a prohibitive way. We address these challenges by paying a fixed one-time cost to repartition source data into semantically similar domains and reusing training gradients to estimate domain importance. We propose **R&B**, a two-stage framework that re-partitions training data based on semantic similarity (**Regroup**) to create finer-grained domains, then efficiently optimizes the data composition (**Balance**) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, **R&B** removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify **R&B**'s effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of **R&B** on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01\% additional compute overhead, **R&B** matches or exceeds the performance of state-of-the-art data mixing strategies.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4916
Loading