Open LLM Projects Should Allocate More Compute for Data Than Training

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: large language models, open LLM, data-centric AI, compute allocation, synthetic data, pretraining, token efficiency
TL;DR: Open LLM projects should spend most of their compute on data curation and synthetic generation, not training, because data investment yields 6-9x efficiency gains.
Abstract: Open LLM projects aim to build the best possible open language models under constrained compute budgets. Currently, most allocate the vast majority of their GPU compute to training runs rather than better data. This position paper argues that these efforts should invest the majority of their compute in data, not training. Reported efficiency gains of 6-9x from data curation, filtering, and synthetic generation justify allocating 80% or more of development compute to data work. Beyond producing better models, data investments compound across model generations while individual models are often superseded within months. We discuss allocation strategies and call for open LLM projects to adopt explicitly data-centric compute accounting.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 16
Loading