Data Efficient Pre-training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence

Andreas Paraskeva; Max Johannes van Duijn; Maarten de Rijke; Suzan Verberne; Jan N. van Rijn

Data Efficient Pre-training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence

Andreas Paraskeva, Max Johannes van Duijn, Maarten de Rijke, Suzan Verberne, Jan N. van Rijn

Published: 06 Mar 2025, Last Modified: 30 Apr 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Dataset Selection, Compute Efficiency, Linguistic Competence, Resource-Constrained

TL;DR: Present a pipeline to evaluate language model training on small datasets, finding that BabyLM-trained models show greater formal linguistic competence with lower variance, while TinyStories may boost functional competence and aid curriculum learning.

Abstract: Training large language models is compute- and data-intensive, limiting optimisation and low-resource training, and increasing environmental impact. This paper examines pre-training effectiveness of language models of different sizes on two small, curated datasets and evaluates (i) linguistic competence and (ii) compute efficiency. The datasets are TinyStories, a collection of ChatGPT-generated children's stories, and BabyLM, a small, open-domain dataset. We perform experiments with increasing amounts of data (yielding a learning curve) and size-variants of a Llama-based, decoder-only architecture. We evaluate the pre-trained models on downstream tasks from the BLiMP and GLUE benchmark suites. We find that models trained on BabyLM outperform those trained on TinyStories on formal linguistic competence, but not on functional linguistic tasks. Models pre-trained on BabyLM yield more consistent performance results, as indicated by lower variance across random seeds. We also find that small data samples are representative of the model's ultimate performance, which can aid the early selection of promising candidate models. These findings emphasise the potential of pre-training on small, curated datasets for data-efficient pre-training in resource-constrained settings. Further work that includes additional datasets and model architectures is needed to extend the scope of these findings.

Submission Number: 36

Loading