Big Impact from Small Models: The Power of Curated Corpora

Big Impact from Small Models: The Power of Curated Corpora

ACL ARR 2025 February Submission1870 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we present an alternative approach to language model training that emphasizes data quality over sheer volume. Using the authoritative Encyclopaedia Britannica as corpus, we first limit the size to a 10-million-word dataset. We then expand the training of a specialized model for encyclopedic content generation with the complete 38-million-word corpus. Central to our approach is the use of knowledge distillation, which allowed us to train compact student models guided by larger teacher models, achieving high performance while significantly reducing model complexity. Building on the BabyLlama architecture, our findings reveal that high-quality, curated data combined with effective distillation techniques can facilitate efficient and effective learning. This work highlights promising directions for resource-constrained applications and specialized domain modeling. We will release our programs and models if this paper is accepted.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: distillation, parameter-efficient-training

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 1870

Loading