Keywords: Large Language Models, Data Diversity, Data Quality, Synthetic Data, LLM Reasoning
TL;DR: We present G-Vendi, a data diversity measure that strongly correlates with LLM reasoning generalization in OOD benchmarks; we use this insight to diverse synthetic reasoning data, which leads to SOTA distilled models in NLI and math reasoning.
Abstract: Data diversity is crucial for training a strong language model. Yet metrics of diversity often diverge from this goal, measuring variations in heuristic features—like n-grams or embeddings—that are detached from how the model actually performs on a target task. This motivates us to ask: *Can we redefine data diversity—beyond measuring variations in heuristic features—in a way that better predicts model generalization?* Through large-scale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning—as measured by average model performance on unseen out-of-distribution benchmarks. We introduce **G-Vendi**, a metric that quantifies diversity via the entropy of model-induced loss gradients. G-Vendi scales to million-sample datasets and yet consistently outperforms heuristic alternatives, achieving strong correlation ($\text{Spearman's } \rho \approx 0.9$) with out-of-distribution (OOD) performance across both natural language inference (NLI) and math reasoning tasks. Building on this insight, we present **Prismatic Synthesis**, a framework for generating diverse synthetic data by targeting underrepresented regions in gradient space. Experimental results show that Prismatic Synthesis consistently improves model performance as we scale synthetic data—not just on in-distribution test but across unseen, out-of-distribution benchmarks—significantly outperforming state-of-the-art models in both domains. For example, PrismMath-7B, our model distilled from a 32B LLM without human verification, outperforms R1-Distill-Qwen-7B—trained on proprietary data generated by 671B R1—on 6 out of 7 challenging math benchmarks.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 23891
Loading