R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Albert Ge; Tzu-Heng Huang; John Cooper; Avi Trost; Ziyi Chu; Satya Sai Srinath Namburi GNVV; Ziyang Cai; Kendall Park; Nicholas Roberts; Frederic Sala

R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: data mixing, language models, multimodal models, compute efficiency, data efficiency

TL;DR: We introduce a two-stage framework for enhancing model training on diverse data, first by clustering data then optimizing domain weights.

Abstract: While data mixing strategies have successfully reduced training costs, existing methods suffer from two critical flaws: they rely on predetermined data domains that may fail to capture semantic nuances, and they scale computationally with the number of domains in a prohibitive way. We address these challenges by paying a fixed one-time cost to repartition source data into semantically similar domains and reusing training gradients to estimate domain importance. We propose **R&B**, a two-stage framework that re-partitions training data based on semantic similarity (**Regroup**) to create finer-grained domains, then efficiently optimizes the data composition (**Balance**) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, **R&B** removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify **R&B**'s effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of **R&B** on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01\% additional compute overhead, **R&B** matches or exceeds the performance of state-of-the-art data mixing strategies.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4916

Loading